Archive for June, 2008

Teraflops, Petaflops, and Turning Hours into Minutes, and Minutes into Seconds

Thursday, June 19th, 2008
Posted By Dr. Michael McCool

It’s been a busy couple of weeks. First, we just came back from SIFMA where we demonstrated an approximately 55x speedup on an important (and non-trivial) financial option pricing algorithm, using AMD hardware. (Since we’re using RapidMind, we can also run the same code on all our other hardware targets). This coincided with an announcement of our support for AMD’s FireStream product line.

Second, the ISC conference (http://www.supercomp.de/isc08/content/) is going on right now and there have been a number of new hardware announcements. AMD announced a new FireStream card (FireStream 9250), and NVIDIA also announced a new line of GPUs (Tesla 10P). Both of these products are capable of teraflop performance, which is of course great news for people using RapidMind. At the other end of the scale, the largest Cell BE installation in the world, the LANL Roadrunner (http://www.lanl.gov/roadrunner/), broke the petaflop barrier using an actual benchmark, Linpack. Since RapidMind also targets the Cell BE, this is also good news, as it demonstrates clearly the power of this architecture and its ability to scale in large installations.

Getting back to what we did at SIFMA, we demonstrated a 55x speedup on something called a binomial option pricer. I will talk about this at greater length in an upcoming article, but will mention some interesting points here. First, the binomial pricer, unlike our previous results on Monte-Carlo pricers, is very memory-intensive. It’s similar to iterated convolution and explicit PDE solvers. As we also demonstrated great scalability on Barcelona processors and on the FireStream, this application shows how RapidMind can be used to tackle memory-bound applications as well as compute-bound applications such as Monte Carlo. Second, option pricing is considered a “fundamental primitive” in computational finance. In particular, risk evaluation requires a large number of such evaluations and is an important workload used in day-to-day practice by many financial institutions. The speedup factor we have demonstrated has the potential to reduce such calculations from hours to minutes. As in the other application areas that we target, this can potentially transform the workflow practices where these computations are used. If a computation takes hours, you basically have to run it in batch mode, possibly overnight. If it takes minutes, you can run it many times during the day, as part of an interactive workflow, and use up-to-date inputs, enabling completely new ways to do business. This kind of transformation can create incredible new opportunities for our clients. And as demonstrated by the recent hardware announcement I’ve noted above, multi-core and in particular heterogeneous core processors and their deployments are in a definite growth phase, so even better results can be expected in the near future.

30,000 Bees and the Multi-Core Adrenaline Rush

Tuesday, June 17th, 2008
Posted By Dr. Michael McCool

I recently installed 30,000 bees in two hives. About 3 pounds of bees per hive. I wasn’t stung once.

 

A week later I went to back to check the hives–one colony was a little smaller, not quite 3 pounds, so I wanted to make sure the queen was alive and well and reigning over her brood. I took apart the hive boxes, found the queen in one frame and again, I wasn’t stung. Such friendly bees.

But it’s intimidating when you first open the box and see a bee-covered frame.

You take a deep breath and gather your courage because bees can smell the adrenaline coursing through your veins. I was reading about an informal study recently. Some 200 embedded developers noted why they are slow to adopt embedded multi-core technology. (Survey measures readiness to adopt multicore technology.) To them, it must feel like 60 pounds of bees have swarmed their desks.

The problem looks overwhelming but the reigning messages are wrong. We have the thread locking bee, the single-processor-bias bee– that one took a lot of royal jelly to rear into a queen, and I’ve just spotted the familiar lack-of-determinism bee. This is a messy hive.

(more…)

Nitrogen Narcosis - Part II: The Serious Drawbacks of Explicit Multi-Threading

Thursday, June 5th, 2008
Posted By Dr. Michael McCool

In my last posting, I mentioned that explicit multi-threading has serious drawbacks:

·         Multi-threaded applications are more difficult to test than single-threaded applications and are hard to debug.

The main, underlying problem with multi-threading is non-determinism. Multiple threads running simultaneously don’t run in lockstep unless you explicitly synchronize them.  However, you want to minimize synchronization because it has a negative impact on performance.

Because these threads don’t run in lockstep, and because they can access data structures and devices simultaneously (for example, two threads writing to the same memory location), the result is a very difficult class of bugs to find, reproduce, and solve.  Even inserting synchronization constructs doesn’t always make things easier — mistakes in explicit synchronization are what lead to deadlocks, where two threads are waiting for each other and thus never continue.

The exact timing of all of this is determined by many factors, thus making it impossible to know that when you ship your well-tested product, it’s not going to break instantly because timing has changed ever so slightly.

RapidMind solves this because threading, synchronization and more are handled by the platform such that your application is deadlock-free, race condition free, and deterministic!

Another concern:

·         Explicit multi-threading doesn’t scale well as the number of cores increases.

Multi-threading’s explicit “threads” suggest that you use task parallelism as a model. But task parallelism doesn’t scale well because if you have only K threads where K is some constant, you’re never going to get a speedup over more than K cores.

The threading model is built around the concept of task, where every task has a separate sequence of control.  RapidMind does not use task parallelism but uses data parallelism. Data parallelism is based on the fact that applications often operate on collections of data, and units of work are often associated with separate elements of such collections. The RapidMind platform uses an SPMD (Single Program, Multiple Data) stream programming model. Using an SPMD data-parallelism model, allows you to work with familiar concepts like functions and arrays but also directly express parallel algorithms.

And a final issue:

·         Explicit multi-threading can’t leverage the use of accelerators.

Accelerators are parallel machines, and multi-threading is a way to express parallelism in your code. However, the OS threading APIs only target the main CPU cores on which the OS and applications themselves are running. Taking advantage of accelerators requires using further, accelerator-specific, APIs. Furthermore, languages such as C, C++, Java, and almost all other programming languages in common use, assume a shared-memory programming model, where all memory is accessible equally by all computational devices. This isn’t the case for GPUs, for example, where the GPU has a separate memory that it accesses, and data must be explicitly transferred between main memory and GPU memory.

RapidMind automatically manages synchronization between the host and the accelerator, so that you don’t have to.

 

Multi-Threading: The Nitrogen Narcosis of Programming

Thursday, June 5th, 2008
Posted By Dr. Michael McCool

Explicit multi-threading is like the delusional thinking one has diving at a depth of 100 feet with too much nitrogen in the blood stream. You stumble, you misread your gages, everything takes longer to do. It can happen even to experienced divers.

The remedy is to ascend and the delusional thinking goes away immediately.

It’s not the bends, but nitrogen narcosis is dangerous and completely avoidable. And so is multi-threading.

You can manually multi-thread an application to take advantage of multiple cores, but these projects are time consuming and error prone. Multi-threaded applications are more difficult to develop and test than single-threaded applications. Even experienced software developers find multi-threading challenging and multi-threaded applications hard to debug. Did you misread your gage again? Are you really sure you have air left in your tank?

Like a nitrogen-inebriated diver, you need help to ascend. Otherwise you’ll be stuck at the bottom thinking everything is just fine. Meanwhile, the truth is that multi-threading doesn’t scale well as the number of cores increases. And it can’t leverage the use of accelerators. More importantly, it doesn’t address the issue of memory management (a crucial factor in performance).

RapidMind helps you ascend. The RapidMind platform manages memory, does per-core parallelization as well as multi-core parallelization, transparently enables the use of accelerators (if present), and avoids the debugging challenges of multi-threading.

As a developer you work with a single thread of execution. You write code in standard C++ and use your existing skills, tools and processes. The RapidMind platform then parallelizes the application across multiple cores and manages its execution.
When you ascend, suddenly everything is manageable again.

In the next posts, I’ll discuss some of the specific problems with multi-threading and the advantages of using the RapidMind platform.