Archive for the ‘General’ Category

Performance: What’s it For?

Thursday, May 1st, 2008
Posted By Dr. Michael McCool

It almost seems like a silly question: what good is higher performance? The answer depends on the context. Supercomputers of course, are built for high performance, and are often built expressly to run application workloads that demand it. But what good is high performance on ordinary servers and desktops, and for what kinds of applications?

This is an important question for RapidMind, since we help developers squeeze every drop of performance out of all the processors in their computers. Fortunately, many applications need or want all the performance they can get.

(more…)

Why the Future isn’t Flat

Wednesday, April 23rd, 2008
Posted By Dr. Michael McCool

For millennia people thought the earth was flat. This was a convenient illusion, and sufficient at small scales. However, at larger scales it is necessary to abandon this illusion, and professional sailors and pilots have to use spherical geometry for long-range navigation, or risk getting lost.

Similar convenient illusions exist in programming. The two most important ones are serial execution and constant memory access time.  Of course multi-core processors shatter the serial execution illusion, but in this post I’d like to focus on the second illusion: constant-time memory access.  In addition to parallelism, professional software developers have to face the reality that scalable memory systems have variable access times that depend strongly on locality and isolation.  To build successful, scalable software, the mental models of computation used by software developers have to grow to include this concept.

The constant-access-time “flat” memory illusion is as important as the serial illusion. In a flat memory, every location in memory can be accessed with exactly the same cost, independent of the order of access. For decades, this simple cost model has been used as an implicit assumption in the design of computer programs. Even theoretical computer science, which analyzes the best-case asymptotic complexity of algorithms for solving various abstract problems, is primarily based on this illusion. Unfortunately, it is just an illusion, although computer architects have been able to develop many clever mechanisms to maintain it.  However, the latency of accessing a random word in external main memory (DRAM) is quite slow compared to processor speed, by two orders of magnitude or more. A computer using a memory system consisting only of DRAM would be intolerably slow, so modern machines instead have a memory hierarchy, where copies of certain parts of the memory space are kept in faster, smaller  cache memories. If the most frequently accessed data can be kept in the fastest cache memories, then on average a low access cost can be achieved. The memory hierarchy exploits the fact that typical programs exhibit spatial and temporal locality. This mechanism has been reasonably successful at maintaining the flat memory illusion in serial computers. However, even in serial computers significant performance gains are possible by designing programs using a more realistic memory model. For example, it is worthwhile to use data structures and algorithms that are designed to have high levels of spatial and temporal locality.

In parallel computers, including those based on multi-core processors, the situation becomes much more complex. Memory systems for parallel computers are often divided into two types: shared memory and distributed memory. In reality, though, shared memory is yet another convenient illusion. Hardware is distributed by nature, and scalable shared memory has to be simulated by the computer architect on top of a physically distributed memory. A naive implementation of shared memory based on a single shared physical resource simply cannot scale beyond a small number of cores. Instead, memory resources need to be partitioned into banks and access times will vary depending on how “far away” a memory bank is from a core and how many other cores are trying to access it at the same time.  This is usually called NUMA, or non-uniform memory access. In addition, hierarchical memory systems that maintain copies of data in different places (for example, in multiple caches) now have to ensure that those copies remain consistent. The coherency protocols for maintaining this consistency require significant interprocessor communication.

Memory systems are complex, but the net effect of these considerations is that in order to scale, programs have to be written with high degrees of data locality and data isolation. The more local data can be reused, the better the memory hierarchy can be exploited, and the less chance there is for conflicts when accessing a shared off-chip resource. Likewise, when computations can isolate their data, then unnecessary efforts to maintain consistency can be avoided. Data isolation and data locality also make it easier to parallelize computations, since then they can be reordered without conflict.

Unfortunately, many parallel programming models focus on the computations, without considering the necessary interactions of these computations with data. In fact, to a first approximation, scalable parallel programs should be designed around their flows of data (locality, isolation, and dependencies), not around their computations. Compared to the cost of data movement and the scalability problems of poor data isolation, actual computation is practically free.

The world is not flat. It would be more convenient if it was, but it’s not. The illusion of constant-time memory access needs to be replaced with more accurate conceptual models in programmers’ minds. Fortunately, it is not necessary (or productive) for a programmer to deal with all the low-level details of the hardware. However, a conceptual model that includes data locality and data isolation is essential for getting the best out of today’s processors and those of the future.

The difference between multi-core and multi-processing

Thursday, April 17th, 2008
Posted By Stefanus Du Toit

When discussing the shift to multi-core, I often hear people ask why multi-core, which is relatively new, is so different from multi-processing, which has been with us for decades.

First, let’s start with the basics. Multi-processing simply means putting multiple processors in one system. Symmetric Multi-Processing, or SMP, implies that all of these processors are identical, also known as a homogeneous system. SMP systems have been around in the x86 world for a very long time, and there are software systems that take advantage of SMP well.

From a technical standpoint, the difference between multi-core and SMP is relatively benign. In an SMP system, each processor plugs into a different socket, and multiple processors are connected through some kind of bus. In a multi-core processor, the “core” logic of a processor is replicated multiple times on the same chip. Multiple cores may share data through some on chip logic or shared caches. Multiple cores are presented to applications at the OS level exactly the same way as multiple processors in an SMP system. Furthermore, you can mix the two together, e.g. by having an 8-core system with two processors, each containing four cores.

So, why is programming multi-core considered so much more of a problem than programming SMP systems? It’s not because of some fundamental technical difference between multi-core and SMP. It’s because of the reason why these technologies exist.

(more…)

Welcome everyone, to the official RapidMind blog site

Monday, April 14th, 2008
Posted By Dr. Michael McCool

Welcome everyone, to the official RapidMind blog site. Stefanus Du Toit and I, co-founders of RapidMind, will be using this venue to comment on technology trends and events in the areas of multi-core processors, accelerators, and parallel programming. There’s a revolution going on in software development, and we are an active part of it. We look forward to being able to share some of our experiences from the front.

Multicore Expo 2008: Power Management and the Trouble with Debugging

Monday, April 14th, 2008
Posted By Dr. Michael McCool

I just came back from the Multicore Expo, a small, focused conference held in Santa Clara from April 1st to 3rd. This conference targeted multicore architectures and programming, with a particular emphasis on the embedded space. A lot was covered in this conference, and I can’t go into it all in one post. However, I will touch on two topics: the use of multicore for power management in the embedded space, and various aspects of the multicore software ecosystem, in particular, the problem of debugging.

By small, I mean the attendance was on the order of 500. In this case small is good, though, since the technical density was quite high. Also, there were representatives at this conference from hardware vendors, software vendors, and application developers, so it was possible to have useful discussions involving the entire stack. These discussions were not without controversy, and on the very first day of the conference a fistfight actually broke out among the members of a panel representing each of these different groups. The martial arts expertise shown by the Wind River representative in particular was outstanding, although just a bit suspicious. The AMD panelist held his own, although he took a beating from Intel for a while. Of course it was just an April Fool’s gag, involving some black-belt ringers. Most of the sparring taking place during the rest of the conference was verbal.

On the hardware side, the embedded space uses a variety of processors, including Power (which is dominant in the networking appliance space and was represented by Freescale), ARM (dominant in mobile but also with a strong presence in many other embedded spaces), and Texas Instruments (TI) (whose DSP processors are widely used in telephony infrastructure), and of course Intel. Embedded processors have gone multicore, but for slightly different reasons than on the desktop. In the embedded space, the focus is on power, and multicore architectures give embedded processors a variety of mechanisms to trade off performance for power. Due to the non-linear relationship between power and frequency, with power increasing as cube of the frequency in the worst case, it is often more power-efficient overall to run an application on a number of down-clocked cores than to run it on a single core. Alternatively, cores can be turned off completely when their performance is not required. A TI representative made an interesting comment: in the embedded space they don’t have a power “wall”, but rather a “bonfire”. On the desktop space, frequency has been cranked up so much that it is impossible to go further, and multicore is forced. In the embedded space, they switched to multicore far in advance of the wall: you only run into a wall if you run towards it at full speed, close your eyes, and have a strong ability to ignore the obvious. Instead, in order to maximize efficiency, power/frequency scaling should be treated more like a bonfire: it’s uncomfortable if you get too close, so if you’re smart you find ways to keep an appropriate distance.

There was also a lot of discussion around programming models and tools, with several software vendors tackling different aspects of the problem. Operating system vendors were showcasing multicore enabled versions of their systems with enhanced mechanisms for core affinity. For conversion of multiple-processor applications using different operating systems (such as a real-time OS data plane and a Linux control plane) to a single multicore processor, virtualization tools were presented as a solution. However, it was still recognized that in many cases it is important to improve the performance of a single task in a system via internal parallelization of that task. Parallelization among tasks doesn’t really scale and is also not useful if the application requires only one task at a time.

Tools vendors were focused on the debugging problem: how do you view multiple contexts at once and track down the original source of an error that occurs in parallel, due to perhaps, a race condition? Unfortunately, the traditional breakpoint approach runs into a fundamental limitation: multiple cores cannot all be stopped at the same moment that one of those cores hits a breakpoint. Due to wire delay it can take hundreds of cycles to stop other cores, resulting in significant “slew” that can vary from run to run. TI proposed a solution where a trace could be stored in an on-chip buffer, but then this trace still needs to be analyzed to figure out the alignment and determine the root cause of the problem - although at least printf-like functionality is possible without impacting timing. Even with this, trying to manually track down bugs using traditional mechanisms clearly will not scale to more than four cores, and debugging race conditions and other “new” forms of bugs is extremely difficult even at that scale.

However, there was a light at the end of the tunnel: several other vendors, ourselves included, presented languages and platforms supporting programming models that simply avoid parallelism-specific bugs by construction, effectively making them impossible. This approach means that long and possibly indeterminate debug cycles can be avoided. In fact, the conference opened with an analyst from Venture Development Corporation (VDC) presenting the results of a survey that showed that better multicore programming models were at the top of the list of priorities for embedded application developers.