auto vectorization

Loops are one of the most fundamental concepts in all of the contemporary programming languages. It is fairly common to do computation intensive work in the for loop, especially with media and scientific software. It would be really helpful if we could somehow speed up the calculations done in these for loops. Well several optimizations have been in place like loop unrolling and doing operations on multiple data in a single instruction, popularly know as SIMD.

] SIMD, Single Instruction Multiple Data – By Jon Stokes (ArsTechnica)

Now you need to understand that SIMD is good for your application due to two aspects:

  1. It can operate on multiple chunks of data in one single instruction, so you will definitely see good speed boost
  2. It utilizes Cache well, there by reducing cache miss and increasing performance. Well this is more pronounced in multi-core processors

You can read all about SIMD in an article, written by Jon Stokes of ArsTechnica, and about how caches are laid out and their implications, a very informative article by James Bottomley

Visual Studio 2011 does the job of converting your for loops, appropriately, into vectorized functions.

Let’s look at simple for loop:

void SumArray(double* a, double* b, size_t NumElements, double* c)
{
    for( auto idx = 0U; idx < NumElements; ++idx )
    {
        c[idx] = a[idx] + b[idx];
    }
}

Here it would be possible to just add several doubles together in single add operation, and this is one of the optimizations in Visual Studio 2011.

The code generated is something like this, conceptually:

void SumArray(double* a, double* b, size_t NumElements, double* c)
{
    size_t multiplesOf8 = NumElements/8;
    size_t remaining = NumElements % 8;

    for( auto idx = 0U; idx < multiplesOf8; ++idx )
    {
        //  fetch 8 doubles from a
        //  fetch 8 doubles from b
        //  add a + b in single instruction
        //  store 8 sums in c
        c[idx] = a[idx] + b[idx];
    }

    //  remaining elements that were not added in
    //
    a = a + multiplesOf8*8;
    b = b + multiplesOf8*8;
    c = c + multiplesOf8*8;
    for( auto idx = 0U; idx < remaining; ++idx )
    {
        c[idx] = a[idx] + b[idx];
    }
}