Parallel Computing

Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (“in parallel”). There are several different forms of parallel computing: bit-level, instruction level, data, and task parallelism. Parallelism has been employed for many years, mainly in high-performance computing, but interest in it has grown lately due to the physical constraints preventing frequency scaling. As power consumption (and consequently heat generation) by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multicore processors.

Parallelism can be broadly classified in three types:

  1. Instruction-level
  2. Data Parallel
  3. Task Parallel

Instruction-Level Parallelism (ILP)

Instruction-level Parallelism is a family of processor and compiler design techniques that speeds up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design.

Three of the most common techniques for ILP are:

Superscalar Execution

A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.

Please refer Chapter 4 from Inside The Machine by Jon Stokes for excellent coverage of superscalar architecture.

The major advantage of the Superscalar/Out-of-Order Execution lies with the software developer, as the hardware extracts the parallelism from programmer’s code automatically without any  efforts from programmer. On the other hand it requires a substantial area on the CPU die to maintain dependence information and queues of instructions.

VLIW (very large instruction word)

Very long instruction word or VLIW refers to a CPU architecture designed to take advantage of instruction level parallelism (ILP). A processor that executes every instruction one after the other (i.e. a non-pipelined scalar architecture) may use processor resources inefficiently, potentially leading to poor performance. The performance can be improved by executing different sub-steps of sequential instructions simultaneously (this is pipelining), or even executing multiple instructions entirely simultaneously as in superscalar architectures.

Refer these two articles from Jon Stokes of ArsTechnica for excellent coverage of Pipelining in Microprocessors:

  1. Understanding the Microprocessor Part-1
  2. Understanding the Microprocessor Part-2

VLIW primarily depends on Compiler to generate interleaved and efficient code for parallel execution.

SIMD/Vector

A vector processor, or array processor, is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors. This is in contrast to a scalar processor, whose instructions operate on single data items.

Vector processing techniques have been added to almost all modern CPU designs, although they are typically referred to as SIMD (Single Instruction Multiple Data). In these implementations, the vector unit runs beside the main scalar CPU, and is fed data from programs that know it is there.

SIMD tackles extraction of parallelism in different way as compared to Superscalar and VLIW ILP, while both of these extract independent ILP from unrelated instructions, SIMD allows the hardware instructions to target data parallel executions. The bigger benefit comes from fact that SIMD instructions are generated by Compiler/Developer rather than Processor figuring it out during execution.

SIMD code is also generated automatically by various compilers, also known as loop-unrolling or vectorization, as part of optimization. See here for some details on how VS 2011 generates AVX (Advanced Vector Extensions) instructions for computation acceleration.

SIMD is very commonly used in media and scientific applications for computation acceleration, as the arithmetic density is quite high as compared to control instructions.

Data Parallelism

Data parallelism (also known as loop-level parallelism) is a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes.

Data parallelism is abstraction for doing certain operation on different data. It encompasses SIMD and SPMD (Single Program Multiple Data). It is heavily used in graphics rendering or solving large linear systems as these problems can be very easily expressed in data parallel model.

The data-parallel programming model is both higher level and more restrictive than the task model. It is higher level in sense that the programmer is not required to specify communication structures explicitly: these are derived by a compiler from the domain decomposition specified by the programmer. It is more restrictive because not all algorithms can be specified in data-parallel terms. For these reasons, data parallelism, although important, is not a universal parallel programming paradigm.

Source

Task Parallelism

Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes. It contrasts to data parallelism as another form of parallelism.

The task-parallel programming model is less restrictive than the data-parallel model. It is lower level in sense that the programmer is required to specify communication structures explicitly, in case of task based libraries, like TBB or PPL, communication and control logic is mostly taken care by concurrency runtime. It can model almost all kinds of parallel programming models.

In a multiprocessor system, task parallelism is achieved when each processor executes a different thread (or process) on the same or different data. The threads may execute the same or different code. In the general case, different execution threads communicate with one another as they work. Communication takes place usually to pass data from one thread to the next as part of a workflow.

Summary

In this post I have tried to categorize Parallel Computing in terms of models used to represent the work being done. ILP falls mostly in the domain of hardware and compilers, developers can also write SIMD code but it is better to leave it to Compilers. ILP defines very fine grained level of parallelism and is mostly an implementation technique. While Data and Task parallelism are more of architectural models, they define granularity of parallel work, and at times entire workflow, for your application. Parallelism defined by Data & Task Parallel models is at much coarser level as compared to ILP.

Clearly there is no one model that is universal, in terms of both expressing parallelism and extracting superior performance. Task Parallelism is suitable for expressing concurrency, where workdone is of different nature and/or operates on different data, as it give more control over the entire workflow. Data Parallelism is an excellent choice for doing tons of calculations on a single set of data, e.g., GPU programming, where arithmetic instruction density is very high as compared to control instructions. ILP is great for extracting out the last bit of performance from executable code. However in most of the real life applications you will get the better results by incorporating synergistic combination of these three models.