It’s been a while since I last wrote about parallel programming and C++ and recently I’ve been really seeing quite a lot of articles about the subject. As the days pass by it seems that C++ becomes more and more relevant in the multi-core and parallel computing/programming scene and with the upcoming new standard the renewed interest in the subject is more visible even in online discussion.

Today, I just saw in my reading list an article on auto-vectorization posted on Dr. Dobb’s Journal that’s available in the recent Intel C++ Compilers. It’s interesting that there apparently are quite a lot of things the compiler can do for you in between making sure that your code is syntactically correct and translating it to machine code.

The Intel compiler has a feature that can make some applications run much faster — auto-vectorisation. With a flick of the switch, some code can be sped up significantly. A number of times I have seen programs run much faster just by using this option without changing a line of code. With vectorisation, the compiler uses a set of advanced Single Instruction Multiple Data (SIMD) instructions, which ornate most modern CPUs.

from “I’ve Fallen In Love With the Vectoriser”.

I personally can vouch for the effectiveness of leveraging the processor’s performance in mission critical performance-sensitive applications. I however had been using the tree vectorization optimization that’s been present in the GCC compiler since version 4.1.x (through the -ftree-vectorize option).

Vectorization talks about the technique of transforming code to work in a vector processor — that is, processor that can perform a single instruction on multiple data (SIMD) and hence the root of the SSE instruction set (SSE stands for Streaming SIMD Extensions). This makes repetitive mathematical instructions that can be performed in parallel to be done in fewer instructions than it would normally take if they were done in sequence.

To those that have read about parallel computing in the early days (around the ’70s or ’80s) will know that there has been a wealth of research dealing with how to do parallel programming in many different types of computers. Recently, these techniques seem to be coming up again in the midst of the move to parallel programming — and this is to no surprise. Apparently the evolution from sequential processors (Von Neumann architecture) to parallel processors (systolic arrays, hypercubes, and meshes) translate to even the commodity processors of today.

Let’s take the vector processors: it might be a surprise to some to find out that the GPUs in the modern day graphics cards are actually SIMD processors and that it leverages research that has been done on the parallel computing field. This is the reason why it can perform blazingly fast computations on the fly to generate life-like 3D images at high frame rates. This is also the same reason why the cheapest way to get your hands on a supercomputer today to perform repetitive mathematical computations on large data sets is to get a handful of these GPUs together and program to utilize these processors.

Because of the popularity of these SIMD instruction sets, the modern day multi-core processors actually have a small vector processor embedded that knows how to do practically the same thing albeit at a smaller scale. Actually, this trend has started earlier in the higher performance single-core processors (as early as Pentium III). The first application was for improving processor performance for games that require these sorts of instructions to make certain things feasible.

In the case of the hypercube architecture, you can look at the modern day multi-core processor as a simple case of a 1-dimensional hypercube of processors (dual core) or 2-dimensional hypercube of processors (quad core). In the early days of the hypercube architecture, each processor had dedicated communication channels to each other and did not share data. Today, you can look at multi-core processors as an evolution of this early-day hypercube architecture but with a shared memory space. Starting with the current and next generation many-core processors though, you’ll see NUMA and have each processor be responsible for managing certain memory modules.

This is all interesting from the hardware architectural perspective, but how does it concern C++? For instance the compilers which have greater insight into the target architecture of the machine on which you’ll be running your code will be able to leverage the available instruction sets and generate more efficient code. Knowing the architecture of the processors will also give programmers greater insights about how to leverage the available cache, layout, and memory management characteristics of the system to write effective high-level code in a guided and informed manner. Certain decisions and predictions can be made through deliberate examination of the available technology and tools at your disposal as a C++ programmer (threads, synchronization primitives, memory usage, etc.).

There’s more to come in the coming months and years ahead as we start getting more and more immersed in the reality that multi-core and parallel programming is the future. I forget where I read/heard it before, but this is the classic case where “Resistance is Futile” would definitely apply.


  1. Allister

    Great post, Dean. Thanks.

  2. grey wolf

    You mention a “Hypercube” as a parallel computing paradigm, but the link you give is to the Wikipedia article on the geometric concept.A former coworker of mine described to me the concept of a “hypercube operating system,” but I’m still quite unclear on it. Do you have a different link that might help?

  3. Dean Michael

    Hi Grey Wolf,Actually, Hypercube talks about the interconnection between the compute nodes in a parallel computer or the topology of the processors. If you understand the notion of a three-dimensional representation of a hypercube (where there are N links emanating from each vertex of the graph) you can imagine a parallel computer where you have at each vertex a processor and each link is a dedicated communication channel.This link on parallel computing should give you a better idea about the different interconnect topologies used in parallel computing systems. I hope this helps!

  4. grey wolf

    That makes a good deal more sense, yes. I hadn’t considered such a model because it doesn’t seem to be a very typical hardware model. :) (AMD Opterons notwithstanding.)I suppose what confused me was how this coworker described sort of a speculative processing approach, where results that may be needed are calculated on a free thread and are aborted and discarded if not needed. I recalled that he called it a hypercube, but I may be mistaken.Thank you for the clarification.




Leave a Comment