Parallelism as a First Class Citizen in C and C++, the time has come

by James Reinders

It is time to make Parallelism a full First Class Citizen in C and C++.  Hardware is once again ahead of software, and we need to close the gap so that application development is better able to utilize the hardware without low level programming.

The time has come for high level constructs for task and data parallelism to be explicitly added to C and C++.  This will enable Parallel Programming in C and C++ to be fully portable, easily intelligible, and consistently decipherable by a compiler.

Language solutions are superior to library solutions. Library solutions provide fertile grounds for exploration. Intel Threading Building Blocks (TBB) has been the most popular library solution for parallelism for C++. For more than five years, TBB has grown and proven itself. It is time to take it to the next level, and move to language solutions to support what we can be confident is needed in C and C++.

We need mechanisms for parallelism that have strong space-time guarantees, simple program understanding, and serialization semantics – all things we do not have in C and C++ today.

We should tackle task and data parallelism both, and as an industry we know how.

  • Task parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to individual processor cores. Mapping should be the job of tools (including run time schedulers), not explicit programming. Hence, the goal is to shift all programming to tasks, not threads. This has many benefits that have been demonstrated often. A simple fork/join support is fundamental (spawn/sync in Cilk Plus terminology). Looping with a “parallel for” avoids looping to spawn iterations serially and thereby expresses parallelism well.
  • Data parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to SIMD instructions vs. multiple processor cores vs. attached computing (GPUs or co-processors). Mapping should be the job of tools, not explicit programming. Hence, the goal is to shift all programming back to mathematical expressions, not intrinsics or explicitly parallel algorithm decompositions.

The solutions that show the most promise are documented in the Cilk™ Plus open specification (http://cilkplus.org).  They are as follows:

For data parallelism:

  • Explicit array syntax, to eliminate the need for explicit looping in a program to loop (serially) across elements to do the same operation multiple times
  • Elemental functions, to eliminate the need for authors of functions to worry about explicitly writing anything other than the simple, single wide, version of functions. This leaves a compiler to create wider versions for efficiency by coalescing operations in order to match the width of SIMD instructions.
  • Support for reduction operations in a way that makes semantic sense. For instance, this can be done via an explicit way to have private copies of data to shadow global data that needs to be used in parallel tasks that are created (the number of which is not known to the program explicitly). Perhaps this starts to overlap with task parallelism…

For task parallelism:

  • spawn/join to spawn a function, and to wait for spawns to complete
  • parallel for, to specifically avoid the need to serially spawn individual loop body instances – and make it very clear that all iterations are ready to spawn concurrently.

Like other popular programming languages, neither C nor C++ was designed as parallel programming languages. Parallelism is always hidden from a compiler and needs “discovery.” Compilers are not good at “complex discovery” – they are much better at optimizing and packaging up things that are explicit. Explicit constructs for parallelism solve this and make compiler support more likely. The constructs do not need to be numerous, just enough for other constructs to build upon… fewer is better!

For something as involved, or complex, as parallelism, incorporating parallelism semantics into the programming language improves both the expressability of the language, as well as the efficiency by which the compiler can implement  parallelism.

Years of investigation and experimentation have had some great results. Compiler writers have found they can offer substantial benefits for ease of programming, performance, debugging and portability.  These have appeared in a variety of papers and talks over the years, and could be the topic of future blogs.

continue reading…

Sharing Options: