How we get improved performance on a single core - Part 2

In my last post I blogged about the fact that RapidMind-enabled C++ code often gets improved performance even on a single core. I gave one reason for this, our programming model. In this post I’d like to address another reason: runtime program generation. Read on for part two of this series on single core performance with RapidMind.

RapidMind-enabled C++ code is still C++ code. To be precise, RapidMind-enabled code is C++ code which, at runtime, builds RapidMind programs. Let me give you an example. With RapidMind, you might write the following (extremely trivial) program:

Program my_add = RM_BEGIN {
In<Value1f> a, b;
Out<Value1f> c;
c = a + b; 
} RM_END;

In case you’ve never seen RapidMind code before, this snippet declares a RapidMind program (like a C++ function, but more powerful) called “my_add”. This program is defined to take two inputs (a and b), and produce a single output (c), all of which are scalar floating point values. It produces c by adding together the values in a and b, and that’s it. This is all valid C++ code – all the words you see in blue are types or macros provided by the RapidMind libraries and headers. Later on, you can take my_add and apply it to arrays of numbers, and RapidMind will generate machine code and manage its execution (in parallel!) for you.

Those lines of C++ code compile (through your C++ compiler) into a dialogue between your application and RapidMind. If we imagined this being played out on a stage, it might look something like this:

Program my_add

Application: Hey, RapidMind! I want to make a new program! I want to call it my_add.
RapidMind: OK. Gotcha. Here you go, here’s your new program. It’s empty right now.

my_add = RM_BEGIN

Application: And hey, I’d like to define a new body of code, and store it in my_add. OK?
RapidMind: Yup. Let me just hang on to a copy of this my_add program, so I know where to put anything you might want to define next.

In<Value1f> a, b;

Application: That program you’re building up for me? I’d like to add two inputs to it, please. Call them a and b.
RapidMind: Ah, good thing I’m holding onto this my_add program for you. I’ll just add those two inputs to it. Here they are, do with them as you please.

Out<Value1f> c;

Application: Also, I’d like to add an output to that program I’m building. Call it c.
RapidMind: No worries. Let me just mark that down. One output for my_add, it’s called c. Here you go!

c = a + b; 

Application: OK, now, remember those inputs from earlier, a and b? At this point in the program, I’d like you to remember to add them together, and store them in c.
RapidMind: Got it. I’m still holding on to my_add, and I’ll just append to it that you’d like to add a to b, and store the result in c.

RM_END

Application: Alright. That’s it for now. my_add is done, and I’m not going to add anything else to it anymore.
RapidMind: You’re the boss. Let me just polish my_add a little for you, and here you go. One completed program. Let me know if you want to do something with it later.

 

Yeah, perhaps parallel programs would not make the most interesting of plays. But hopefully the point comes across: when “a” and “b” are “added” in the above example, your application isn’t telling RapidMind to add them together right then and there. It’s telling RapidMind to remember to add them together, when you use my_add later on. And this dialogue is going on at runtime, when those lines of C++ are executing. If “a” and “b” were regular C++ floats, your app would be having a similar dialogue with your C++ compiler at compile time, and at runtime, those lines really would mean “add a to b right now“.

So, what’s so special about this? Well, this is a pretty powerful mechanism! By delaying the construction of these programs until runtime, we can start doing some very natural tricks, like the following:

for (int i = 0; i < 10; ++i) { c = c + a; }

Think for a moment about this code. It looks pretty plain. “Add a to c ten times”. However, what this code is really saying is “tell RapidMind to remember to add a to c” ten times in a row. It’s like the following dialogue:

Application: Hey, remember c and a? Could you remember to add them together and update the result in c?
RapidMind: Sure thing. I’ll append that to the list of things you’d like this program to do.
Application: Hey, remember c and a? Could you remember to add them together and update the result in c?
RapidMind: Uh, yeah, sure thing. I’ll append that to the list of things you’d like this program to do.
Application: Hey, remember c and a? Could you remember to add them together and update the result in c?
RapidMind: Umm OK… Once again, I’ll append that to the list of things you’d like this program to do. Boy, this app is pretty boring!
…and so forth, 7 more times… 

In compiler terminology, this is called unrolling, and it’s a pretty handy optimization sometimes to avoid lots of branching that regular loops can cause. Compilers really aren’t always very good at unrolling for you, because they often can’t tell if it’s safe to do so. Of course you can still write regular (non-unrolled) loops in RapidMind too, but you can just as easily write unrolled loops. If you’ve tried to unroll loops by hand in C or C++ code, you know that it usually leads to a lot of code repetition, ugly C macros, external scripts, or something along those lines.

This is just one example of how runtime program generation can be used to implement some optimization. You can also do things like generate different specialized versions of a program for different tasks, optimize a computation for a particular piece of data you only know at runtime, try out different ways of implementing a program and choose between them at run-time (at no cost!), etc. The list goes on and on. There’s also some inherent benefit to do doing this – for example, modularity constructs like classes in C++ don’t end up taking any overhead in your computations, but can still be used to structure your code cleanly.

These kinds of things are particularly useful when you start tuning parameters, which brings me back to the high-level topic of this post: how C++ code can get better performance on a single core with RapidMind than without it. We often implement algorithms that have all sorts of “tuning knobs.” When we do so, we typically write them in a very generic way – we don’t fix any of these knobs, we just make them available in the code. Then we go and tune away (by hand or with tools we’ve developed for this purpose). At the end, we end up with some specific parameter settings that give optimal performance, without sacrificing flexibility or code quality!

Thus, runtime program generation is a very powerful tool to have at your disposal when it comes to writing high performance code. And it’s just one of the neat tools you get when you use RapidMind to express your applications!

I’ll be discussing another, related, reason for RapidMind’s high performance even without multiple cores in my next post. Stay tuned!

Leave a Reply