Why the Future isn’t Flat
For millennia people thought the earth was flat. This was a convenient illusion, and sufficient at small scales. However, at larger scales it is necessary to abandon this illusion, and professional sailors and pilots have to use spherical geometry for long-range navigation, or risk getting lost.
Similar convenient illusions exist in programming. The two most important ones are serial execution and constant memory access time. Of course multi-core processors shatter the serial execution illusion, but in this post I’d like to focus on the second illusion: constant-time memory access. In addition to parallelism, professional software developers have to face the reality that scalable memory systems have variable access times that depend strongly on locality and isolation. To build successful, scalable software, the mental models of computation used by software developers have to grow to include this concept.
The constant-access-time “flat” memory illusion is as important as the serial illusion. In a flat memory, every location in memory can be accessed with exactly the same cost, independent of the order of access. For decades, this simple cost model has been used as an implicit assumption in the design of computer programs. Even theoretical computer science, which analyzes the best-case asymptotic complexity of algorithms for solving various abstract problems, is primarily based on this illusion. Unfortunately, it is just an illusion, although computer architects have been able to develop many clever mechanisms to maintain it. However, the latency of accessing a random word in external main memory (DRAM) is quite slow compared to processor speed, by two orders of magnitude or more. A computer using a memory system consisting only of DRAM would be intolerably slow, so modern machines instead have a memory hierarchy, where copies of certain parts of the memory space are kept in faster, smaller cache memories. If the most frequently accessed data can be kept in the fastest cache memories, then on average a low access cost can be achieved. The memory hierarchy exploits the fact that typical programs exhibit spatial and temporal locality. This mechanism has been reasonably successful at maintaining the flat memory illusion in serial computers. However, even in serial computers significant performance gains are possible by designing programs using a more realistic memory model. For example, it is worthwhile to use data structures and algorithms that are designed to have high levels of spatial and temporal locality.
In parallel computers, including those based on multi-core processors, the situation becomes much more complex. Memory systems for parallel computers are often divided into two types: shared memory and distributed memory. In reality, though, shared memory is yet another convenient illusion. Hardware is distributed by nature, and scalable shared memory has to be simulated by the computer architect on top of a physically distributed memory. A naive implementation of shared memory based on a single shared physical resource simply cannot scale beyond a small number of cores. Instead, memory resources need to be partitioned into banks and access times will vary depending on how “far away” a memory bank is from a core and how many other cores are trying to access it at the same time. This is usually called NUMA, or non-uniform memory access. In addition, hierarchical memory systems that maintain copies of data in different places (for example, in multiple caches) now have to ensure that those copies remain consistent. The coherency protocols for maintaining this consistency require significant interprocessor communication.
Memory systems are complex, but the net effect of these considerations is that in order to scale, programs have to be written with high degrees of data locality and data isolation. The more local data can be reused, the better the memory hierarchy can be exploited, and the less chance there is for conflicts when accessing a shared off-chip resource. Likewise, when computations can isolate their data, then unnecessary efforts to maintain consistency can be avoided. Data isolation and data locality also make it easier to parallelize computations, since then they can be reordered without conflict.
Unfortunately, many parallel programming models focus on the computations, without considering the necessary interactions of these computations with data. In fact, to a first approximation, scalable parallel programs should be designed around their flows of data (locality, isolation, and dependencies), not around their computations. Compared to the cost of data movement and the scalability problems of poor data isolation, actual computation is practically free.
The world is not flat. It would be more convenient if it was, but it’s not. The illusion of constant-time memory access needs to be replaced with more accurate conceptual models in programmers’ minds. Fortunately, it is not necessary (or productive) for a programmer to deal with all the low-level details of the hardware. However, a conceptual model that includes data locality and data isolation is essential for getting the best out of today’s processors and those of the future.

April 30th, 2008 at 5:05 pm
Interesting post. Looking forward to other similar posts. I’ve already bookmarked the page ;)