Meaningful Benchmarks

Lately, I’ve been thinking about apples and oranges, as in the “comparing of.”  In other words, benchmarks. The RapidMind platform enables the development of high-performance software, and so we often need to quantify the performance improvements made possible by our technology. However, benchmarks must be done with care in order to be meaningful, and I want to discuss a few of the issues here and our philosophy in setting up good benchmarks.

Benchmarks involve comparing the performance of (at least) two things. For example, we may run the same program on two different processors to compare the performance of these processors, or we may compare two different implementations of the same algorithm on the same processor to compare implementation strategies, or we may compare the performance of two different algorithms for solving the same problem.

A benchmark is an experiment. Whenever possible, in order to get meaningful results from an experiment, we should vary one thing at a time, and control the other variables. Unfortunately, when comparing a serial implementation to a parallel implementation, we may have to change the algorithm in order to achieve a parallel implementation, as not all serial algorithms are parallelizable. The other problem is that when moving between processors, an implementation of an algorithm that is optimal for one processor may not be optimal for another. Finally, we often have to change many other things, such as the operating system or the compiler.

Since it may be necessary to change multiple variables when moving from one implementation to another, we have to compare the best possible performance available on either side of a benchmark. It is more reliable to compare the peak performance of tuned implementations than that of untuned implementations.

We use two strategies to achieve peak performance on either side of a benchmark when comparing a RapidMind implementation to a non-RapidMind implementation.

First of all, on the RapidMind side we can use autotuning. RapidMind implementations of algorithms can be parameterized, and these parameters can be chosen using an automatic tuning process to achieve an optimal implementation of that algorithm on a given processor. This is one of the ways that the RapidMind platform supports portable high-performance implementations of algorithms.

Second, we need to seek an independently developed and tuned baseline for any benchmark. For example, in 2006 we did a financial benchmark in cooperation with HP’s High Performance Computing Division. We both started from the same unoptimized baseline code. Then, we did a RapidMind implementation and they tuned the baseline. In the end, after tuning and using appropriate compiler flags with Intel’s icc (an excellent optimizing compiler), the researchers at HPCD achieved over a 3x speedup over the original, unoptimized baseline code. However, the RapidMind implementation was over 32x faster when we ran on an NVIDIA 7800 GPU. Recently, we added x86 CPU support to RapidMind, and find we are still getting a 17x speedup when running against this tuned (single core) baseline on a dual Intel quad-core machine. In other words, RapidMind’s implementation is more than 2x as fast per core on the same processor. In addition, the RapidMind implementation is now over 140x faster than the baseline code when running on newer NVIDIA G80 GPUs.

Since then we have done a number of other benchmarks, using baselines that were either independently developed (for example, in public libraries) or that we have requested independent partners to tune for us. It is important to note that by using tuned baselines, we can be confident that the results of our benchmarks are realistic, and in fact are conservative, speedups.

Of course benchmarks do not tell the entire story, since productivity and portability are also important considerations. However, if performance is the primary goal of moving to multi-core processors and accelerators, then we have to be able to measure it properly.

Leave a Reply