Multicore Expo 2008: Power Management and the Trouble with Debugging

I just came back from the Multicore Expo, a small, focused conference held in Santa Clara from April 1st to 3rd. This conference targeted multicore architectures and programming, with a particular emphasis on the embedded space. A lot was covered in this conference, and I can’t go into it all in one post. However, I will touch on two topics: the use of multicore for power management in the embedded space, and various aspects of the multicore software ecosystem, in particular, the problem of debugging.

By small, I mean the attendance was on the order of 500. In this case small is good, though, since the technical density was quite high. Also, there were representatives at this conference from hardware vendors, software vendors, and application developers, so it was possible to have useful discussions involving the entire stack. These discussions were not without controversy, and on the very first day of the conference a fistfight actually broke out among the members of a panel representing each of these different groups. The martial arts expertise shown by the Wind River representative in particular was outstanding, although just a bit suspicious. The AMD panelist held his own, although he took a beating from Intel for a while. Of course it was just an April Fool’s gag, involving some black-belt ringers. Most of the sparring taking place during the rest of the conference was verbal.

On the hardware side, the embedded space uses a variety of processors, including Power (which is dominant in the networking appliance space and was represented by Freescale), ARM (dominant in mobile but also with a strong presence in many other embedded spaces), and Texas Instruments (TI) (whose DSP processors are widely used in telephony infrastructure), and of course Intel. Embedded processors have gone multicore, but for slightly different reasons than on the desktop. In the embedded space, the focus is on power, and multicore architectures give embedded processors a variety of mechanisms to trade off performance for power. Due to the non-linear relationship between power and frequency, with power increasing as cube of the frequency in the worst case, it is often more power-efficient overall to run an application on a number of down-clocked cores than to run it on a single core. Alternatively, cores can be turned off completely when their performance is not required. A TI representative made an interesting comment: in the embedded space they don’t have a power “wall”, but rather a “bonfire”. On the desktop space, frequency has been cranked up so much that it is impossible to go further, and multicore is forced. In the embedded space, they switched to multicore far in advance of the wall: you only run into a wall if you run towards it at full speed, close your eyes, and have a strong ability to ignore the obvious. Instead, in order to maximize efficiency, power/frequency scaling should be treated more like a bonfire: it’s uncomfortable if you get too close, so if you’re smart you find ways to keep an appropriate distance.

There was also a lot of discussion around programming models and tools, with several software vendors tackling different aspects of the problem. Operating system vendors were showcasing multicore enabled versions of their systems with enhanced mechanisms for core affinity. For conversion of multiple-processor applications using different operating systems (such as a real-time OS data plane and a Linux control plane) to a single multicore processor, virtualization tools were presented as a solution. However, it was still recognized that in many cases it is important to improve the performance of a single task in a system via internal parallelization of that task. Parallelization among tasks doesn’t really scale and is also not useful if the application requires only one task at a time.

Tools vendors were focused on the debugging problem: how do you view multiple contexts at once and track down the original source of an error that occurs in parallel, due to perhaps, a race condition? Unfortunately, the traditional breakpoint approach runs into a fundamental limitation: multiple cores cannot all be stopped at the same moment that one of those cores hits a breakpoint. Due to wire delay it can take hundreds of cycles to stop other cores, resulting in significant “slew” that can vary from run to run. TI proposed a solution where a trace could be stored in an on-chip buffer, but then this trace still needs to be analyzed to figure out the alignment and determine the root cause of the problem - although at least printf-like functionality is possible without impacting timing. Even with this, trying to manually track down bugs using traditional mechanisms clearly will not scale to more than four cores, and debugging race conditions and other “new” forms of bugs is extremely difficult even at that scale.

However, there was a light at the end of the tunnel: several other vendors, ourselves included, presented languages and platforms supporting programming models that simply avoid parallelism-specific bugs by construction, effectively making them impossible. This approach means that long and possibly indeterminate debug cycles can be avoided. In fact, the conference opened with an analyst from Venture Development Corporation (VDC) presenting the results of a survey that showed that better multicore programming models were at the top of the list of priorities for embedded application developers.

Leave a Reply