How we get improved performance on a single core - Part 1
RapidMind is all about achieving the full performance potential of modern multi-core processors. Generally, total performance is a combination of two factors:
total performance = scalability across cores × per-core performance
It’s probably no surprise that RapidMind aims to provide excellent scalability across cores. What might surprise you is that we spend significant amounts of effort on single-core performance as well, and RapidMind-enabled apps often outperform code written in C/C++ without RapidMind significantly. Some of our customer case studies show improvements like “25x faster than non-RapidMind-code on 8 cores” – that 25x number is made up of perfect scaling across 8 cores, combined with a 3x performance advantage even on a single core.
How’s that possible? I’m going to explore this in my next few blog posts. Read on for the first reason why we get such good performance:
Reason 1: Our programming model
RapidMind is built entirely inside of C++. When you write RapidMind code, you’re writing C++, nothing else. The code you write, however, is written to use our primitives (”Values, Arrays and Programs”) to express its computational and data-related operations. We’re extremely careful when designing these primitives to avoid making decisions that make it difficult for our platform to optimize your code (while providing you with all the flexibility you need to express your computations, and without adding significant burden on your part).
For example, computations expressed in RapidMind don’t have “plain pointers” like C/C++, where a pointer could (potentially) point to any particular area in memory. Instead, you have pointers to the beginning of any given array (these are called Accessors and ArrayRefs in RapidMind parlance) and you combine these with indices to look up data in an array. This makes it much easier for the platform to determine when two references to memory might refer to the same location (this is known as alias analysis) which in turn makes it much easier for our backends to optimize your computations.
Our programming model is also much more explicit about locality of computations and data than pure C++. This helps us allocate memory efficiently, optimize the right chunks of code together, perform cache optimizations, and much more. Good expression of locality is critical to achieving good scalability, but it’s also an invaluable aid in getting good single-core performance.
By designing our programming model to keep in mind what’s needed to generate efficient code on modern architectures, we are able to get a performance advantage over plain C/C++, which are burdened by language choices that can make it difficult for compilers and processors to do their jobs well. Luckily it turns out a lot of the choices that allow better scalability also allow better single-core performance, and vice versa. Additionally we do this without sacrificing the syntax, modularity concepts (such as functions, classes, and templates) and tools available to C++ developers.
Stay tuned for more reasons we can achieve such high performance per core in my next blog post!
