Parallel Pattern 4: Gather

June 24th, 2009
Posted By Dr. Michael McCool

In this blog series, I am reviewing a set of structured parallel patterns. Compositions of these patterns can be used to implement a wide variety of algorithms, and can be used as the basis for a parallel programming platform.

In this post, I am going to discuss a very simple pattern: gather. A gather operation is a parallel random read of a set of elements from memory. It is often described as a “collective” operation, where an array and a set of indices are provided as input and the output is another array which is the result of all the reads. Gather can also be seen as the combination of a simple “serial” memory read combined with the map pattern discussed previously. Using a random read inside a function used in a map results in a gather. As with other maps, the order in which elements are read in a gather should not affect the outcome of the gather.

Gather is simple but there are a few variants which are worth thinking about, since some of the variants allow for specific optimizations.

First, are the addresses static or dynamic? If the addresses are (relatively) static, then it may be possible to do some analysis to reorder memory accesses to improve performance. For example, when performing many versions of the FFT (Fast Fourier Transform) a “scramble” operation is required that reorders the inputs or outputs. This reordering is given by taking the index of every element in the array and computing a new location by bit-reversing that index. For example, in an 8-point scramble with 3-bit indices, 0 maps to 0, 1 maps to 4, 2 maps to 2, 3 maps to 6, 4 maps to 1, 5 maps to 5, 6 maps to 3, and 7 maps to 7. When performing a map operation, we have some flexibility in the order in which elements of the map are processed. Modern memory systems are most efficient when data is accessed in large, coherent blocks. So we may want to reorder processing so that once a block is read into on-chip memory, we gather all the elements from it that we need before moving onto the next. Of course, a map operation may have multiple gather operations, and also has to write its outputs somewhere. Reordering the elements of a map needs to consider the effect on the efficiency of all inputs and the output. However, if the inputs are static AND the platform knows they are static, there may be significant opportunities for optimization.

There are also some patterns of gather that are regular. For instance, a filtering operation may want to access certain neighbours of a pixel, but for every output pixel needs the same relative neighbourhood. This is a very common pattern in imaging and simulation, and we cover it later under the “stencil” pattern. Another regular pattern, which we call the “partition” pattern, results when input is explicitly broken into a set of coherent regions. Both of these can benefit from specific optimizations.

In general, though, addresses for a gather are computed dynamically, and are irregular. For instance, such a pattern may be needed when traversing a data structure for a ray-tracer. In this case, it is still possible to improve performance by using “extra” parallelism to hide latency. A memory system may have a high bandwidth, but may take a certain amount of time to satisfy each memory request. However, it is often possible to have many outstanding memory requests in flight at the same time. To hide the latency of memory access, we start one thread until it tries to access memory, then while that thread is waiting for memory we start another when, and when it reaches a point where it is accessing memory we start another one, and so forth until the memory request for the first thread can be satisfied. This approach may be supported in hardware: “hyperthreading”, or simultaneous multithreading, is mostly useful for hiding memory access latency. However, the technique can also be implemented in software by unrolling loops and using software pipelining. These optimizations can and should be optimized by a code generator, however, since they are complex and make code hard to understand and maintain. Also, unrolling loops also increases the pressure on other resources, such as cache and registers, so a balance has to be struck.

In summary, the simple operation of gather has a couple of variants with associated optimizations. If a gather is static, we may be able to reorder it for higher performance. If it is dynamic, we may be able to use latency hiding, as long as we do not exceed other resource limits. And finally, if the gather satisfies certain regular patterns, we may be able to apply yet another set of optimizations: I will discuss the stencil and partition patterns in particular in future posts.

Parallel Patterns 3: Map

June 10th, 2009
Posted By Dr. Michael McCool

Structured parallel programming can be based around the composition of common patterns. Over a series of blog postings I plan to discuss each of a small set of (mostly) deterministic patterns for parallel computing. A structured, safe-by-default parallel programming model can be based on the composition of these patterns.

In my last two blog postings, I discussed two particular parallel patterns: superscalar sequence and speculative selection. These are closely related to serial computing; in fact, they represent ways that “implied” parallelism in a serial program can be extracted, if suitable restrictions are placed on the “serial” programming model.  Drawing an analogy with sequential programming, you would expect the next pattern to be iteration. However, while iteration is used in serial programming to express many computations that are potentially parallelizable, it is not itself a parallel pattern. This is because in serial programming, each iteration can depend in arbitrary ways on previous iterations, so in general some loops are parallelizable while others, which can look very similar, are not. For parallel programming, we want explicit constructs that we are sure can be executed in parallel.

If we disallow dependencies between loop iterations, we do get a type of computation that can be parallelized easily: the map pattern. In the map pattern, a set of loop indices are generated and independent computations are performed for every unique index. These computations are not allowed to communicate with one another. At the end of the map, however, there is an implied barrier, and then other operations can depend on the output of the map operation as a whole.

Often data access is associated with the map. In particular, the map may be generated by applying a function to an array and replicating that function over every element of the array. In this case the index set is generated from the set of allowable array indices. In the same way, the outputs of a map may be collected into a contiguous output array. This relationship between processing and data input and output can be diagrammed as follows:

These forms of data access are very coherent in nature and are ideal targets for optimization.  For example, a compiler or parallel programming platform may want to use knowledge of a map’s coherence to generate code to prefetch data into a cache and evict data from cache, or to otherwise pre-schedule data movement. In addition, the map can be hierarchically “strip-mined” to exploit multiple levels of parallelism and multiple parallelism mechanisms. Parallel mechanisms in processors include superscalar instruction issue, pipelining, and parallel memory/compute operations in addition to the more obvious mechanisms such as multiple cores and vector instructions.  Starting from the very simple parallelism structured represented by the map, it is possible for an automated system to generate an efficient implementation that can exploit multiple mechanisms at the same time.  For example, the above map may be implemented on a dual-core processor with 4-way vector units and 4-cycle pipeline latency as follows:

Iteration gives rise to other patterns besides map.  It is possible to parallelize loops in certain cases even when there are dependencies between loop bodies.  However, the strong assertion of independence between iterations provided by the map gives the most opportunity for parallel execution.

It is also common to see the map operation combined with other operations, such as random read or write.   It is also possible to elaborate on the input and output data access patterns, to allow, for example, strided access, multiple inputs and outputs, and neighbourhood operations.   Some of these variants are important enough that I will discuss them in later postings.

Parallel Pattern 2: Speculative Selection

June 3rd, 2009
Posted By Dr. Michael McCool

Over the last couple of weeks, I introduced the idea of structured patterns as a basis for high-level parallel programming, with the notion of blogging about each of about a dozen patterns in turn. Together these patterns constitute a minimal yet broadly applicable set of components for deterministic parallel programming. In this series, this is second such pattern.

The next pattern is selection, which chooses the results from one of two alternative tasks based on a condition. This can be used for speculative parallel execution, where both alternative tasks (as well as the condition) are executed in parallel while the condition is being evaluated. This pattern is most useful for complex conditions that take time to evaluate. It’s also an example of a parallel pattern that may lead to more work overall than serial execution.

Once the condition is evaluated one of the two alternative tasks needs to be cancelled. This means stopping its consumption of resources as soon as possible. It also means ignoring its output. Being able to stop a task before completion is very useful in this context, and can lead to significant savings in execution time.In particular, with cancellation if an infinite number of processors are available execution only takes as long as the “right” path would have taken. In practice, computers don’t have an infinite number of processors, but speculation can still be used if there are processors that would otherwise be idle.

Depending on the programming model, “ignoring the output” can also be challenging. Ideally the tasks are pure functions and so ignoring the return values of the functions representing the tasks is all that is needed. If tasks are allowed to have side effects, these effects need to be rolled back, which can be difficult or impossible to implement. Therefore, in order to expose opportunities for speculative parallelism, pure functions need to be used, or more generally, permanent modifications to global state need to be deferred until the condition has been evaluated.

Speculative execution is used at a very low level in some processor architectures. For example, the destination instruction for a branch might be fetched before the branch condition is evaluated. There is a possibility that some work might be wasted. In fact, the last pattern we discussed, superscalar sequence, is also frequently used in processor designs for out-of-order execution of independent instructions. Many of the other patterns we will present, such as map, are also common in low-level processor design. We are interested in applying patterns to larger tasks than single instructions, but one of the characteristics of a pattern is that it recurs at different scales.

Why we chose LLVM

May 27th, 2009
Posted By Stefanus Du Toit

In a recent post I mentioned that we are transitioning to using the LLVM project for code generation on our x86 backend in our upcoming release. I wanted to take some time to talk about why we chose LLVM.

First, a bit of background. Our x86 backend has been shipping since our 3.0 release around the beginning of 2008, allowing any RapidMind program to take advantage of multi-core x86-family processors. In particular, we support any x86-compatible CPUs that have at least SSE2 support (though if the processor supports it, we will use newer instruction set extensions such as SSE3 as well).

RapidMind backends have three major functions: they manage the execution of tasks, handle allocation and transfer of data, and perform code generation. For the x86 backend, code generation involves compiling the intermediate representation in which RapidMind programs are represented down to x86 machine code. Like all RapidMind backends, the x86 backend performs this compilation step at runtime (typically, when a RapidMind-enabled application starts up).

In 2007, we developed a code generation library called CHAIR. CHAIR is a C++ library (using policy-based design) that includes a retargetable representation for low-level machine code, and code generation algorithms such as a register allocator, an instruction scheduler, and analysis and optimization passes. Today, CHAIR is used in our currently shipping x86 backend and our Cell backend. It is also used in our new CUDA backend that will be shipping with our upcoming release.

When we first started building our x86 backend we decided to focus primarily on good generation of vector code using the various SSE extensions to the x86 instruction set. RapidMind provides a very natural way to express vector code since our primitive Value types are not restricted to be scalar, but can hold any number of components. Our primarily data-parallel model also lends itself really well to auto-vectorization.

While we were quite happy with the results of our vector code generation, we knew that our scalar code generation could be improved. After releasing our x86 backend we found ourselves spending an increasing amount of timing working on scalar code generation issues, as well as general compilation issues such as register allocation (which on x86 microarchitectures can be of huge importance to getting good performance). This was taking time away from working on higher-level optimizations and better parallel expressiveness. So, after re-evaluating the LLVM project in mid-2008 we decided to shift gears and switch to using LLVM instead of CHAIR for x86 code generation.

We chose LLVM primarily for the following reasons:

  1. LLVM is distributed under the Illinois Open Source License, which is nearly identical to the BSD license and is very open for use both by open-source and proprietary projects. Some other projects, such as Open64, are licensed under more restrictive licenses like the GPL, which were not compatible with our own licensing model.
  2. LLVM has good scalar x86 code generation support. We did some tests to compare code generated by our existing backend, gcc, icc, and llvm. In general, llvm did well and generated code comparable with gcc and icc. In some cases, llvm’s code outperformed gcc’s code. After some prototyping we were satisfied with the quality of scalar x86 code generated by LLVM.
  3. LLVM has vector support in its intermediate representation (IR). In the RapidMind IR, vectors are first-class objects - pretty much any instruction can operate on vectors just as it can on scalars. We did not want to compromise on the quality of our vector code. We reported some inconsistencies between scalar and vector operations in the LLVM IR to the mailing list and after some discussions at the 2008 LLVM dev meeting, the LLVM developers fixed the few remaining inconsistencies. Today the LLVM IR is very well-suited for representing vector instructions.
  4. LLVM has a very active developer community. If you subscribe to the llvm-commits mailing list you’ll notice quickly that there are a number of people actively enhancing LLVM, from the open source community as well as commercial contributors from companies such as Apple and Cray. LLVM is rapidly developing on all fronts, and by using it within our product we benefit from these ongoing improvements. Additionally, the LLVM developer community is very responsive. Bugs reported by us during our initial development were usually fixed within a day or two, and when we’d set out to fix small bugs we came across, we often found that the LLVM trunk already had a bug fix and we only needed to update our SVN checkouts.
  5. LLVM’s IR is very well-designed, and its core API is an example of a stellar C++ interface. The intermediate representations of CHAIR and LLVM both embrace Static Single Assignment (SSA) form, which made the principles of the LLVM IR familiar and desirable to us. LLVM’s runtime C++ representation of programs is very cleverly designed, in terms of being easy to use and in terms of being space and time efficient. Several wishlist items for CHAIR’s IR representation were already implemented in LLVM.
  6. LLVM’s IR is designed to be self-contained and serializable to a human-readable (or at least compiler-developer-readable :) format that can be processed by a variety of command-line tools it supplies. This means we can dump the IR generated for a RapidMind program and perform experiments such as testing optimization passes on it very conveniently. LLVM’s bugpoint tool is a huge debugging aid as well, greatly simplifying the usually tedious and error-prone task of reducing code generation bug test cases.
  7. LLVM is designed as a set of libraries, and runtime compilation is a normal use-case for it. Since we perform compilation at runtime, and the RapidMind platform is provided as a C++ library, we needed a code generation project that did not rely on static compilation or using the filesystem for storage of data between compilation passes. Furthermore, passes in LLVM such as its register allocator are designed with the performance demands of runtime compilation in mind.

In addition to these points, LLVM provides us with the potential to more easily build backends for the other targets it supports and makes it very simple for us to call out to existing C/C++ functions from a code generation perspective.

There are a few things we’d love to see improved in LLVM. Windows (in particular, Visual Studio) support is lagging behind some of the other platforms, although this has improved considerably recently and been helped by the introduction of a CMake-based build system. While the LLVM intermediate representation supports vector operations well, the backends have not quite caught up yet and we’ve had to implement a number of vector operations using a pass transforming from generic LLVM IR to x86/SSE intrinsics. That said, developers in the LLVM community more familiar with its backend architecture (one of the areas of LLVM we’ve found slightly harder to work on) have been making great strides in this area, and we hope to eventually turn our intrinsics work into full support for these instructions in the LLVM x86 code generator. Right now, we’re busy putting the finishing touches on our llvm-based x86 backend and are excited to release it.

We’re grateful to the LLVM community for all the support they’ve provided to us, and above all, for an excellent code base to use to solve a difficult and time-consuming problem. We’re happy to spend less time on low-level code generation, and are gearing up to tackle some interesting higher-level challenges once again.

Parallel Pattern 1: Superscalar Sequences and Task Graphs

May 26th, 2009
Posted By Dr. Michael McCool

In my last post I introduced the idea of structured patterns as a basis for high-level parallel programming. Over the coming weeks I will be blogging about each of about a dozen patterns in turn. Together these patterns constitute a minimal yet broadly applicable set of components for deterministic parallel programming.

The first pattern I will present is one that at first glance is serial: the sequence. However, a sequence of operations in a program, if it can be reordered while respecting data dependencies, can in fact express fairly general forms of task parallelism.

Suppose the following sequence of functions is invoked. Capital letters denote variables and lower-case letters denote (pure) functions:

B = f(A)
C = g(B)
E = f(C)
F = h(C)
G = g(E,F)
P = p(B)
Q = q(B)
R = r(G,P,Q)

The data dependencies in this code generate the following graph:

Diagram 1

Although the input code is conceptually serial, the data dependencies in the graph potentially allow several operations to execute in parallel, such as p and q and f and h. In fact, the entire subgraph consisting of {g, f, h, g} can also execute in parallel with the subgraph given by {p,q}. If the interface allows the system to clearly identify data dependencies, then a platform can automatically extract any latent parallelism in this pattern.

Note, however, the need for pure, side-effect free functions. In particular, data access needs to be limited to clearly defined inputs and outputs. Certain language features, such as global pointers able to access anything in memory, would create global data dependencies. These need to be avoided in order to make automatic transformation of such sequences into parallel task graphs possible. Also, graphs containing loops are not possible. I will discuss other patterns that can be composed with sequences in later posts to express the concepts of recursion and looping (iteration).

The “superscalar sequence” pattern is related to the idea of “futures”. Futures are objects that explicitly represent in-progress asynchronous computations and typically provide the ability to check for completion (with or without blocking). Futures may also provide operations to check completion percentages of computations and to cancel operations in progress. Futures are also occasionally useful for fine-grained control of asynchronous operations, but in general future-like controls can also be built into the data objects of the sequence pattern and only invoked when necessary.

C++0x and parallelism

May 20th, 2009
Posted By Stefanus Du Toit

C++ is evolving. The next version of the C++ standard, currently referred to as “C++0x” (the x being a variable depending on when the new standard is completed), is going to provide a number of interesting new features and language changes. I thought I’d take some time to talk about changes in C++0x that are likely to be relevant to parallelism and programming multi-core machines.

Threading and the memory model

The most obvious parallelism-related changes in C++0x are the new thread-aware memory model as well as the addition of threads and related concurrency primitives to the standard library. The N2857 draft (PDF link) includes:

  • A memory model that specifies how multiple interacting threads affect shared memory and can be appropriately synchronized
  • A thread class to allow an application to spawn and manage threads
  • A library of classes and functions to access atomic instructions such as compare-and-exchange
  • Classes for mutual exclusion and condition variables
  • Classes to manage futures, handles for results which are not yet available that will typically be computed asynchronously in other threads
  • The thread_local keyword to indicate variables with thread storage duration

These new facilities should be really useful to those of us writing explicitly threaded code. Explicit multi-threading is not a particularly useful way to express parallelism (see these blog entries for a few reasons why this is the case). However, explicit threading definitely has its uses in concurrency (for example in user interface programming, where multiple long-running tasks are typically occurring simultaneously) and in implementing the low-level portions of parallel programming platforms such as ours. Today we rely on (typically mutually incompatible) operating system facilities to access these features; C++0x will allow us to do so from within portable, standard C++ code.

The new memory model defines what expectations a program can reasonably have when multiple threads are interacting. This allows tools like compilers to reason about the behaviour of a multi-threaded program. Without this model, it is easy for a compiler to introduce optimizations that are safe in serial code, but become unsafe in multi-threaded code (see this paper by Hans Boehm for concrete examples). Today explicitly-threaded programs deal with these issues with inappropriately restrictive features like the “volatile” keyword or overly aggressive locking.

I’m excited about these concurrency-related changes in C++0x, but I think they will be more directly applicable to concurrent applications than parallel computation. Having language-level interfaces for concepts such as locks will hopefully also help improve interoperability between threaded applications.

Lambda functions

C++0x introduces a new core language feature, the ability to define lambda functions. This allows applications to define functions directly in places where expressions are used. For example, in C++0x it will be valid to write the following:

// Warning: untested code! :)
typedef pair<int, int> IntPair;
std::vector<IntPair> v;

// Sort v by the second element of the pair
std::sort(v.begin(), v.end(),
          [](const IntPair& a, const IntPair& b) { return a.second < b.second; });

The funky-looking construct in the last parameter is a lambda expression. In this particular case we are passing a comparison function to the standard library sort() function that compares only the second elements of two pairs of integers. In C++ today we would have to define this as a function or functor outside of the body of code calling sort().

The reason this relates to parallelism is that platforms and libraries for parallelism often require the developer to pass in functions representing tasks or other kinds of computation. Having the syntax of lambdas available makes it possible to use these kinds of interfaces with a lot less boilerplate code. For example, a parallelism library might contain a parallel “for each” function that one might use with lambdas as follows:

float a[1000], b[1000], c[1000];

// In parallel, sum all of the elements from a and b into c.
parallel_for_each(1000, [&](int index) { c[index] = a[index] + b[index]; });

Our hypothetical parallel_for_each function takes a number of instances and a function to execute. In this case we ask for 1000 instances, and pass in a function that runs for each index. Note that we’re specifying more than just a simple function here: we’re actually passing a closure to parallel_for_each, as our lambda expression contains references to local variables. The “&” in the square braces that start the lambda expression specify that these are to be captured by reference at the time the expression is evaluated.

Concepts

Concepts are probably the largest addition to C++0x. Concepts provide core language facilities for generic programming, a technique that has gained a lot of ground in the C++ community. For a nice introduction to concepts, see this talk by Doug Gregor.

A concept is a description of a set of requirements that classes might satisfy. The C++0x draft provides a number of predefined concepts (e.g. LessThanComparable and FloatingPointLike) and allows you to define your own concepts. Concepts can be marked “auto”, which means that the compiler will automatically consider a class as satisfying a particular concept as long as the class meets the requirements. If a concept is not marked as auto, a concept map can be used to state that a class satisfies a particular concept and exactly how it does so. It’s worth noting that the concept map can be separate from the class declaration, so you can retrofit legacy code with concepts easily.

When you write a template, you can specify that certain template parameters must satisfy certain concept requirements. If you attempt to pass in a type that does not satisfy those requirements, your compiler can provide a meaningful error message, telling you exactly which requirements are not satisfied.

Concepts have fairly little relation to parallelism by themselves, but they are of great interest to us at RapidMind. RapidMind’s current C++ frontend consists mostly of a number of class and function templates. We’re not doing anything out of the ordinary, like template metaprogramming, but because we provide basic types for things like containers of floating point and integer data, our core types must be templated fairly extensively to work with any kind of data. We’ve spent a lot of effort to make these types work as naturally as possible within regular C++ code, but this hasn’t always been trivial and compiler errors tend to generate gargantuan verbose error messages, as is often the case with template code. Concepts will allow us to more easily extend our frontend and provide better error messages to our users.

Beyond C++0x

Most other changes in C++0x are unrelated to parallelism. Some changes are generally performance-oriented, such as move constructors and constexpr functions. The new time-related facilities such as classes for clocks and durations will be helpful for timing measurements. Numerous changes, e.g. the new auto type specifier, should make any C++ developer’s life easier. Overall we’re excited at RapidMind about the changes in C++0x and think it’s heading in a great direction.

What’s going to happen after C++0x? It’s clear that the C++ standard could go much further in supporting high-level parallel programming. We’ve seen a strong interest amongst software developers in a high-level, open, standards-based approach to targeting multi-core processors and accelerators. One obvious avenue for such a development is language standards themselves. For this reason I recently joined the C++ working group to see whether it makes sense to consider high-level parallel computing after C++0x. Of course there are other avenues too, and we are equally interested in possibilities such as extending OpenMP to map to accelerators, or providing a higher-level layer within OpenCL.

There are a number of interesting options for extending C++ with parallelism. We’ve thought about ways in which an approach similar to that of the RapidMind platform, but with closer integration to the language itself, might look. The idea of ranges instead of iterators (PDF link) as proposed by Andrei Alexandrescu would likely open up the standard library to a lot more parallelism. There have been numerous other proposals for providing higher-level parallel constructs in C++.

Personally, I’m excited about the improvements C++0x will bring to the language. But I’m even more excited about the possibility of evolving C++ to make parallel computing easier and more performant than ever before. As C++0x comes to completion, we can hopefully start having those discussions and see where they lead us.

References

For more information on C++0x, have a look at:

Structured Patterns: A Basis for a High-Level Parallel Programming Standard

May 15th, 2009
Posted By Dr. Michael McCool

A high-level standard is needed for parallel programming that addresses the needs of large-scale software development.   Existing standards and standard proposals do not provide the required levels of reliability and maintainability; an alternative approach is required.   A top-down approach based on the composition of structured, deterministic patterns of parallelism can satisfy the needs of large-scale software development while still providing high levels of scalability and performance.

Recently there has been a move towards standards for many-core computing.   However, almost all systems and standards in parallel computing to date have been very mechanism-oriented.   A mechanism-oriented standard is designed bottom-up, putting a thin layer of abstraction over features that are motivated primarily by existing hardware architectures.

In contrast, our philosophy has been application-oriented.  An application-oriented system considers the needs of application algorithms first and the natural ways for these to be expressed.  Requirements that large-scale applications have that are not being addressed by mechanism-oriented system and standards proposals are the need for safety, structure, and determinism.

We have demonstrated that it is possible to satisfy these needs in a high-performance system.  Our approach has been based on the identification of structured patterns of computation in application workloads that can be reliably mapped onto high-performance implementations by an automated system. Many of these patterns happen also to be deterministic, and their composition results in a safe and deterministic but high-performance implementation.

Programming models and platforms can be based on these patterns, and can provide a level of automated support for efficient implementation and optimization. Programming platform interfaces that can allow structured patterns of parallel computation to be expressed also allow the programming system to better capture the intent of the software developer, especially with regard to memory coherence.

This more application-oriented, structured approach to designing and implementing parallel algorithms via a supporting platform is particularly relevant for many-core processors with a large amount of parallelism. While improving the productivity of experts, specific patterns and fused combinations of patterns can also guide relatively inexperienced users to developing efficient algorithm implementations that have good scalability.

This approach to parallelism can include a unified approach to developing parallel software.  Deterministic patterns can include both collective “data-parallel” patterns such as map and reduce as well as structured “task-parallel” patterns such as superscalar task graphs. The structured pattern based approach, like data-parallel models, addresses issues of both data access and parallel task distribution in a common framework.  Optimization of data access is important for both many-core processors with shared memory systems and accelerators with their own memories not directly attached to the host processor.

Ultimately, the industry needs to see new or existing standards evolve to address the parallelism challenge from the application point of view.  With this approach, both data and task parallel algorithms can be mapped onto patterns that can then be executed efficiently on both multi-core processors and accelerators.

We have identified on the order of a dozen patterns that are useful for deterministic parallel computing. Considering the complexity and importance of the problem, this is not actually a large set, but is too much for a since post.   I will therefore be posting about each of these in turn over the coming weeks, along with examples of their usage.  I will also be discussing how these simple patterns can form the basis for a solid high-level standard for highly productive and modular parallel computing.

Medical Imaging Webinar on May 13th

May 4th, 2009
Posted By Dr. Michael McCool

I’ll be giving a webinar on medical imaging on May 13th at 2:00 PM ET.   The announcement is below; see the bottom for a registration link.   Due to how we’re hosting this, space really IS limited, so if you want to attend, please register soon.

Rapidmind has been doing a lot lately in the medical imaging area and are also going to be launching a suite of volume processing components.   These components are designed to solve specfic problems such as registration using pre-packaged, high-performance implementations.  I’ll be talking about these components and our performance results.   I will also discussing some other case studies and how to use the RapidMind platform itself to develop your own high-performance medical imaging algorithm implementations, using algebraic reconstruction as an example.

Michael McCool


Are you ready to process higher volumes of data fast enough to meet your customers’ expectations?

> Do you need to deliver advanced medical image processing but you are being limited by performance on larger volumes?
> Do you need  to focus on key differentiators in your product, not on the complexities of programming multi-core processors and GPUs?

Our customers are telling us:

> GPUs and multi-core CPUs provide tremendous performance but they are difficult to program
> GPU performance results are unpredictable
> Software organizations need to limit features or data sizes in order to achieve performance
> Development of new features on multi-core processors and accelerators is taking too long
> Applications need to have deployment flexibility and not be locked into one hardware target
> Development effort needs to be future-proofed and scalable
> They don’t want to maintain multiple code bases in order to support both x86 and GPU

Attend this webinar and find out how some medical imaging companies are delivering state of the art performance on medical imaging applications including segmentation, registration, and rendering. This presentation will discuss the issues facing Medical Imaging applications and explore various solutions to simplify application development with:

> Software components and libraries that solve some of the challenges limiting functionality in the volume processing pipeline
> Accelerated components that provide core building blocks for Medical Imaging applications
> A flexible platform that allows your own algorithms to be expressed in a portable and performant fashion

Make plans for this free webinar—Space is limited!

May 13 at 2:00 PM ET
Join Dr. Michael McCool, Chief Scientist at RapidMind as he explores the trends in medical imaging and next generation parallel programming techniques and algorithms.
Registration: https://www2.gotomeeting.com/register/885782906

Upcoming platform release

April 20th, 2009
Posted By Stefanus Du Toit

It’s been a while since my last post! Rest assured that we have been busy at RapidMind. I thought I’d take a moment to talk about a few things we’ve been working on that will be appearing in an upcoming release.

Our next product release will include several major backend-related changes. One was already announced in November of last year: we have adopted the LLVM project to perform code generation in our x86 backend. Previously we were using our own proprietary code generation stack on x86. In order to spend less time working on backend code generation and leverage the great work done in LLVM, we have switched to using LLVM for our low-level x86 code generation. For our users, this will be a mostly invisible change, since this is very much an under-the-covers improvement. RapidMind applications, in particular ones that use a lot of local arrays or perform many scalar computations (e.g. address calculations), should see a performance boost from LLVM if they use the x86 backend.

We’re also adding an entirely new backend to our product. Our next release will include a backend for NVIDIA’s CUDA architecture. This will allow us to take better advantage of NVIDIA GPUs and accelerators. While we already support NVIDIA GPUs via OpenGL, using CUDA allows us to get access to the GPU hardware in a way that is optimized for general-purpose computations rather than graphics. This leads to less overhead in launching computations, more opportunities for optimization, and access to “compute-focused” features like double-precision floating point computations.

Our Cell backend will also include some major improvements in the next release. Apart from a number of performance optimizations and bug fixes, we are adding full support for double-precision floating point computations. We have also added support for the PowerXCell 8i version of the Cell Broadband Engine, which provides enhanced double-precision performance.

In addition to these backend improvements we have added a few new frontend features, improved performance of the platform itself, and of course fixed a number of bugs. Our documentation has been overhauled as well. We’ll be making announcements about availability of the new release within the next few months.

It’s been a busy year already, but I’m really excited about a few of the other things we have planned for the rest of this year that I’m not ready to talk about yet. Watch this space!

SC08 Review

November 28th, 2008
Posted By Stefanus Du Toit

As mentioned in my last blog post, RapidMind was at SC08, 2008’s Supercomputing conference and exhibition, last week. We had a beautiful booth on the showfloor (pictured below) and sent a number of people to the event. Read on for my impressions of the week.

With every conference we’ve attended over the past few years, there’s a clear trend: the industry is becoming more and more aware of the need to tackle multi-core computing, and the potential of using many-core accelerators such as Graphics Processing Units (GPUs) and the Cell BE processor to improve performance. This has never been so clear as at this year’s SC08. We had a constant stream of people coming to our booth, and it seemed like almost all of them had an immediate need to tackle multi-core and many-core computing.

Read the rest of this entry »