<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>The Official RapidMind Blog</title>
	<atom:link href="http://blogs.rapidmind.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blogs.rapidmind.com</link>
	<description>The vision of parallel programming for multi-core architectures.</description>
	<pubDate>Tue, 15 Jul 2008 20:51:07 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
	<language>en</language>
			<item>
		<title>How we get improved performance on a single core - Part 3</title>
		<link>http://blogs.rapidmind.com/2008/07/15/how-we-get-improved-performance-on-a-single-core-part-3/</link>
		<comments>http://blogs.rapidmind.com/2008/07/15/how-we-get-improved-performance-on-a-single-core-part-3/#comments</comments>
		<pubDate>Tue, 15 Jul 2008 20:51:07 +0000</pubDate>
		<dc:creator>Stefanus Du Toit</dc:creator>
		
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://blogs.rapidmind.com/?p=34</guid>
		<description><![CDATA[This is the third, and last, post in a series of posts about how we can achieve improved performance over regular C++ code even when running on a single core. I&#8217;ve talked about our programming model and runtime program generation previously. In this post, I&#8217;ll discuss our runtime code generation mechanism.
As mentioned in my last post, RapidMind [...]]]></description>
			<content:encoded><![CDATA[<p>This is the third, and last, post in a series of posts about how we can achieve improved performance over regular C++ code even when running on a single core. I&#8217;ve talked about <a href="http://blogs.rapidmind.com/2008/05/15/how-we-get-improved-performance-on-a-single-core-part-1/">our programming model</a> and <a href="http://blogs.rapidmind.com/2008/05/27/how-we-get-improved-performance-on-a-single-core-part-2/">runtime program generation</a> previously. In this post, I&#8217;ll discuss our runtime <em>code generation</em> mechanism.</p>
<p>As mentioned in my last post, RapidMind generates machine code at runtime. This is similar to just-in-time compilation, but is done at very specific (and controllable) points in an application&#8217;s lifetime – typically during application initialization. The responsibility of generating machine code for a specific hardware target belongs to RapidMind&#8217;s <em>backends</em>. Each backend includes code generation support for any targets it supports. For example, the OpenGL backend for GPUs generates OpenGL shading language programs corresponding to a user&#8217;s computations. Backends like the x86 and <a href="http://en.wikipedia.org/wiki/Cell_%28microprocessor%29" target="_blank">Cell</a> backends generate machine code for those architectures using a custom code generation stack, including a backend optimizer, scheduler, register allocator, etc.</p>
<p><span id="more-34"></span>The x86 backend is particularly interesting. Our x86 backend targets x86-based CPUs such as those from AMD and Intel. The <a href="http://en.wikipedia.org/wiki/X86_architecture" target="_blank">x86 instruction set</a> is very old, and modern x86 processors provide more of a translation layer from x86 instructions to some underlying architecture-specific instruction set (the <em>microarchitecture</em> of the processor) than a direct implementation of the x86 instructions. Therefore one x86 processor is not like another - microarchitectural differences between vendors and even between different processor generations are vast. This means that a single x86 binary compiled with just one particular microarchitecture in mind may not perform optimally on another microarchitecture. Even though modern CPUs have features like out-of-order scheduling that help &#8220;generic&#8221; code execute well, we&#8217;ve found there is often still plenty of performance to be obtained by performing microarchitecture-specific optimizations.</p>
<p>This problem (or opportunity?) is further compounded by extensions to the x86 instruction set. Starting with extensions like MMX and 3DNow, processor vendors have been providing new instructions only implemented in newer hardware. These instruction set extensions are generally aimed at accelerating particular types of computations, e.g. by providing vector operations that can compute multiple instances of the same operation at once. Today the prevalent family of extensions is the &#8220;SSE&#8221; (<a href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions" target="_blank">Streaming SIMD Extensions</a>) family. Many different SSE extensions are implemented in hardware shipping today: SSE, SSE2, SSE3, SSSE3, SSE4A, SSE4.1, and SSE4.2. By 2010 processors will be shipping with support for SSE5 (from AMD) and AVX (a new instruction set extension by Intel). These extensions provide a lot of opportunity for improving performance, but code generated targeting a particular extension will not run on older processors that do not support it. Traditional software development thus has to either use some lowest common denominator (e.g. SSE2, which is supported in most processors shipping over the last 4 years or so) or provide many different binaries of the same code and pick one at runtime. These compatibility issues have really hampered adoption of these extensions.</p>
<p>Both of these issues - microarchitectural differences and new instruction sets - are addressed by our backend and code generation design. Since our platform generates code for performance-critical pieces of applications at runtime, we can check to see <em>exactly</em> which CPU we are running on, and generate code optimized for that CPU. Even though we only require SSE2 support to run, we will generate code that makes use of other SSE extensions if they&#8217;re available. We schedule instructions very differently on AMD processors than we do on Intel processors, because of differences in how these processors execute code. Taking advantage of the specific microarchitecture we&#8217;re on can yield anywhere from a 10% improvement to a doubling in performance!</p>
<p>We use the same mechanism to optimize generated code based on other factors known at runtime, such as the alignment of arrays in memory or knowledge of certain values being constant, but unknown until the application is actually running (e.g. constants read from a data file during application initialization). Unlike a traditional just-in-time (JIT) compiler, our code generation happens at very specific, predictable, and controllable times. We never interpret code, and we don&#8217;t have to profile at runtime to find hot spots. Portions of an application not expressed with RapidMind, such as UI or data-handling code, do not undergo this mechanism. Therefore the overhead impact of this runtime work is minimal, and it becomes worthwhile almost immediately.</p>
<p>This concludes my set of articles on why RapidMind can often get not only improved scalability across cores, but also improved performance on a per-core basis than code expressed without RapidMind. I hope you found it interesting! As always, feel free to leave comments if you have any further questions, and I&#8217;ll do my best to answer them.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.rapidmind.com/2008/07/15/how-we-get-improved-performance-on-a-single-core-part-3/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Teraflops, Petaflops, and Turning Hours into Minutes, and Minutes into Seconds</title>
		<link>http://blogs.rapidmind.com/2008/06/19/teraflops-petaflops-and-turning-hours-into-minutes-and-minutes-into-seconds/</link>
		<comments>http://blogs.rapidmind.com/2008/06/19/teraflops-petaflops-and-turning-hours-into-minutes-and-minutes-into-seconds/#comments</comments>
		<pubDate>Thu, 19 Jun 2008 18:57:13 +0000</pubDate>
		<dc:creator>Dr. Michael McCool</dc:creator>
		
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://blogs.rapidmind.com/?p=33</guid>
		<description><![CDATA[It&#8217;s been a busy couple of weeks. First, we just came back from SIFMA where we demonstrated an approximately 55x speedup on an important (and non-trivial) financial option pricing algorithm, using AMD hardware. (Since we’re using RapidMind, we can also run the same code on all our other hardware targets). This coincided with an announcement [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been a busy couple of weeks. First, we just came back from SIFMA where we demonstrated an approximately 55x speedup on an important (and non-trivial) financial option pricing algorithm, using AMD hardware. (Since we’re using RapidMind, we can also run the same code on all our other hardware targets). This coincided with an announcement of our support for AMD&#8217;s FireStream product line.</p>
<p>Second, the ISC conference (<a href="http://www.supercomp.de/isc08/content/">http://www.supercomp.de/isc08/content/</a>) is going on right now and there have been a number of new hardware announcements. AMD announced a new FireStream card (FireStream 9250), and NVIDIA also announced a new line of GPUs (Tesla 10P). Both of these products are capable of teraflop performance, which is of course great news for people using RapidMind. At the other end of the scale, the largest Cell BE installation in the world, the LANL Roadrunner (<a href="http://www.lanl.gov/roadrunner/">http://www.lanl.gov/roadrunner/</a>), broke the petaflop barrier using an actual benchmark, Linpack. Since RapidMind also targets the Cell BE, this is also good news, as it demonstrates clearly the power of this architecture and its ability to scale in large installations.</p>
<p>Getting back to what we did at SIFMA, we demonstrated a 55x speedup on something called a binomial option pricer. I will talk about this at greater length in an upcoming article, but will mention some interesting points here. First, the binomial pricer, unlike our previous results on Monte-Carlo pricers, is very memory-intensive. It&#8217;s similar to iterated convolution and explicit PDE solvers. As we also demonstrated great scalability on Barcelona processors and on the FireStream, this application shows how RapidMind can be used to tackle memory-bound applications as well as compute-bound applications such as Monte Carlo. Second, option pricing is considered a “fundamental primitive” in computational finance. In particular, risk evaluation requires a large number of such evaluations and is an important workload used in day-to-day practice by many financial institutions. The speedup factor we have demonstrated has the potential to reduce such calculations from hours to minutes. As in the other application areas that we target, this can potentially transform the workflow practices where these computations are used. If a computation takes hours, you basically have to run it in batch mode, possibly overnight. If it takes minutes, you can run it many times during the day, as part of an interactive workflow, and use up-to-date inputs, enabling completely new ways to do business. This kind of transformation can create incredible new opportunities for our clients. And as demonstrated by the recent hardware announcement I&#8217;ve noted above, multi-core and in particular heterogeneous core processors and their deployments are in a definite growth phase, so even better results can be expected in the near future.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.rapidmind.com/2008/06/19/teraflops-petaflops-and-turning-hours-into-minutes-and-minutes-into-seconds/feed/</wfw:commentRss>
		</item>
		<item>
		<title>30,000 Bees and the Multi-Core Adrenaline Rush</title>
		<link>http://blogs.rapidmind.com/2008/06/17/30000-bees-and-the-multi-core-adrenaline-rush/</link>
		<comments>http://blogs.rapidmind.com/2008/06/17/30000-bees-and-the-multi-core-adrenaline-rush/#comments</comments>
		<pubDate>Tue, 17 Jun 2008 20:36:56 +0000</pubDate>
		<dc:creator>Yvonne Chypchar - Contributing Editor</dc:creator>
		
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://blogs.rapidmind.com/?p=27</guid>
		<description><![CDATA[I recently installed 30,000 bees in two hives. About 3 pounds of bees per hive. I wasn&#8217;t stung once.
 

A week later I went to back to check the hives&#8211;one colony was a little smaller, not quite 3 pounds, so I wanted to make sure the queen was alive and well and reigning over her brood. [...]]]></description>
			<content:encoded><![CDATA[<p>I recently installed 30,000 bees in two hives. About 3 pounds of bees per hive. I wasn&#8217;t stung once.</p>
<p> </p>
<p class="MsoNormal" style="margin: 0in 0in 0pt; line-height: 15.6pt; text-align: center;"><a href="http://blogs.rapidmind.com/wp-content/uploads/2008/06/beehive1.jpg"><img class="alignnone size-medium wp-image-31" title="beehive1" src="http://blogs.rapidmind.com/wp-content/uploads/2008/06/beehive1-300x225.jpg" alt="" width="152" height="114" /></a></p>
<p>A week later I went to back to check the hives&#8211;one colony was a little smaller, not quite 3 pounds, so I wanted to make sure the queen was alive and well and reigning over her brood. I took apart the hive boxes, found the queen in one frame and again, I wasn&#8217;t stung. Such friendly bees.</p>
<p style="text-align: center;"><a href="http://blogs.rapidmind.com/wp-content/uploads/2008/06/bees-on-frame.jpg"><img class="alignnone size-medium wp-image-32" title="bees-on-frame" src="http://blogs.rapidmind.com/wp-content/uploads/2008/06/bees-on-frame-300x225.jpg" alt="" width="139" height="105" /></a></p>
<p>But it’s intimidating when you first open the box and see a bee-covered frame.</p>
<p style="text-align: center;"><img class="alignnone size-medium wp-image-30" title="beekeeper" src="http://blogs.rapidmind.com/wp-content/uploads/2008/06/beekeeper.jpg" alt="" width="143" height="107" /></p>
<p>You take a deep breath and gather your courage because <strong>bees can smell the adrenaline coursing through your veins</strong>. I was reading about an informal study recently. Some 200 embedded developers noted why they are slow to adopt embedded multi-core technology. (<strong><span style="font-size: 10pt; color: #0000ff;"><a href="http://dataweek.co.za/article.aspx?pklArticleId=5243&amp;pklCategoryId=31" target="_blank">Survey measures readiness to adopt multicore technology.) </a></span></strong>To them, it must feel like 60 pounds of bees have swarmed their desks.</p>
<p>The problem looks overwhelming but the reigning messages are wrong. We have the <strong></strong><strong>thread locking </strong>bee, the <strong>single-processor-bias</strong> bee&#8211; that one took a lot of royal jelly to rear into a queen, and I&#8217;ve just spotted the familiar <strong>lack-of-determinism</strong> bee. This is a messy hive.</p>
<p><span id="more-27"></span></p>
<h2>A New Hive and a New Reign</h2>
<h5>RapidMind Fixes the Single Processor Bias Problem</h5>
<p>The <strong>RapidMind</strong> parallel programming model is portable to a wide range of parallel hardware architectures, including vector and stream machines, such as GPUs, as well as distributed memory machines, such as the Cell BE. The system provides a strong execution and data abstraction that is simultaneously modular, portable, and efficient.</p>
<p>The RapidMind platform provides a set of backends. Each manages the execution of RapidMind programs on a particular processor. The RapidMind platform manages communication and data flow between the host processor and target device(s). It handles memory transfers and load balancing, leaving you free to focus on high-level programming. The dynamic runtime-compiler and processor support modules compile RapidMind programs optimally for the specific processor in use.</p>
<ul>
<li>The GLSL backend executes RapidMind programs on Graphics Processing Units (GPUs).</li>
<li>The Cell BE backend executes RapidMind programs on the Cell BE Broadband Engine.</li>
<li>The x86 backend executes RapidMind programs on AMD and Intel processors.</li>
<li>The Debug backend executes RapidMind programs on the host processor, compiling programs with a C compiler.</li>
</ul>
<h5>RapidMind Has Advanced Debug Support</h5>
<p>The Debug backend executes RapidMind Programs on the host processor, compiling the RapidMind Programs with a C++ compiler. Debug information is generated for the compiled programs, allowing them to be debugged in a debugger using techniques such as setting breakpoints, inspecting values and stepping through code. This allows the RapidMind Programs to be debugged, line by line, within a debugger or IDE.</p>
<p>The RapidMind Inspector allows you to view how data in a RapidMind-enabled application is modified as the application is executed. (It is an optional package available in RapidMind Multi-Core Platform Tools.) The RapidMind Inspector provides graphical views that present not just the Program values at a given iteration of the Program but also of the entire data bound to those values. This allows you to inspect the contents of an array bound to an input value or view the contents of an array bound to an output value. Moreover, the RapidMind Inspector allows you to control the execution of the Program so that you can watch how the data is modified from one iteration to the next.</p>
<h5>RapidMind Programs Are Deadlock-Free and Deterministic</h5>
<p>The main, underlying problem with multi-threading is non-determinism. Multiple threads running simultaneously do not run in lockstep unless you explicitly synchronize them. However, you want to minimize synchronization because it has a negative impact on performance.</p>
<p>Because these threads do not run in lockstep, and because they can access data structures and devices simultaneously (for example, two threads writing to the same memory location), the result is a very difficult class of bugs to find, reproduce, and solve. Even inserting synchronization constructs does not always make things easier — mistakes in explicit synchronization are what lead to deadlocks, where two threads are waiting for each other and thus never continue.</p>
<p>The exact timing of all of this is determined by many factors, thus making it impossible to know that when you ship your well-tested product, it’s not going to break instantly because timing has changed ever so slightly. <strong>RapidMind solves this because threading, synchronization and more are handled by the platform such that your application is deadlock-free, race condition free, and deterministic. </strong>Programs written with our platform cannot suffer from deadlock, read-write hazards, or synchronization errors. The platform uses a bulk synchronization model that supports a conceptual single thread of control, making debugging straightforward. The structure of the language makes parallelism explicit, however, encouraging the development and use of efficient and scalable parallel algorithms.</p>
<h5>RapidMind Allows for Performance Tuning</h5>
<p>You can trace important performance events by using the <strong>RapidMind</strong> platform performance log. This log contains messages generated whenever the platform notices something relevant to performance, such as the inability to perform a certain optimization, or a feature that is being used inefficiently. Most messages are generated during runtime compilation of the program. However, some important messages (that is, transfers to host memory) are generated when preparing to execute a compiled program. The performance log features several different levels of verbosity. See also “<a href="http://blogs.rapidmind.com/2008/05/15/how-we-get-improved-performance-on-a-single-core-part-1/" target="_blank">How we get improved performance [using RapidMind] on a single core</a>”</p>
<p>So here are the makings of a fully-functioning hive. The bees know what they&#8217;re supposed to do because the right framework has been set. Your work is done here.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.rapidmind.com/2008/06/17/30000-bees-and-the-multi-core-adrenaline-rush/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Nitrogen Narcosis - Part II: The Serious Drawbacks of Explicit Multi-Threading</title>
		<link>http://blogs.rapidmind.com/2008/06/05/nitrogen-narcosis-part-ii-the-serious-drawbacks-of-explicit-multi-threading/</link>
		<comments>http://blogs.rapidmind.com/2008/06/05/nitrogen-narcosis-part-ii-the-serious-drawbacks-of-explicit-multi-threading/#comments</comments>
		<pubDate>Thu, 05 Jun 2008 19:52:46 +0000</pubDate>
		<dc:creator>Yvonne Chypchar - Contributing Editor</dc:creator>
		
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://blogs.rapidmind.com/?p=20</guid>
		<description><![CDATA[In my last posting, I mentioned that explicit multi-threading has serious drawbacks:
·         Multi-threaded applications are more difficult to test than single-threaded applications and are hard to debug.
The main, underlying problem with multi-threading is non-determinism. Multiple threads running simultaneously don&#8217;t run in lockstep unless you explicitly synchronize them.  However, you want to minimize synchronization because it [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-size: small; font-family: Calibri;">In my last posting, I mentioned that <em>explicit multi-threading</em> has serious drawbacks:</span></p>
<p class="MsoListParagraph" style="margin: 0in 0in 10pt 0.5in; text-indent: -0.25in; mso-list: l0 level1 lfo1;"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore;"><span style="font-size: small;">·</span><span style="font: 7pt ">         </span></span></span><span style="font-size: small; font-family: Calibri;">Multi-threaded applications are more difficult to test than single-threaded applications and are hard to debug.</span></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-size: small; font-family: Calibri;">The main, underlying problem with multi-threading is non-determinism. Multiple threads running simultaneously don&#8217;t run in lockstep unless you explicitly synchronize them.<span style="mso-spacerun: yes;">  </span>However, you want to minimize synchronization because it has a negative impact on performance.</span></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-size: small; font-family: Calibri;">Because these threads don&#8217;t run in lockstep, and because they can access data structures and devices simultaneously (for example, two threads writing to the same memory location), the result is a very difficult class of bugs to find, reproduce, and solve.<span style="mso-spacerun: yes;">  </span>Even inserting synchronization constructs doesn&#8217;t always make things easier &#8212; mistakes in explicit synchronization are what lead to deadlocks, where two threads are waiting for each other and thus never continue.</span></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-size: small; font-family: Calibri;">The exact timing of all of this is determined by many factors, thus making it impossible to know that when you ship your well-tested product, it&#8217;s not going to break instantly because timing has changed ever so slightly.</span></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><strong><span style="font-size: small;"><span style="font-family: Calibri;">RapidMind solves this because threading, synchronization and more are handled by the platform such that your application is deadlock-free, race condition free, and deterministic!</span></span></strong></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-size: small; font-family: Calibri;">Another concern:</span></p>
<p class="MsoListParagraph" style="margin: 0in 0in 10pt 0.5in; text-indent: -0.25in; mso-list: l0 level1 lfo1;"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore;"><span style="font-size: small;">·</span><span style="font: 7pt ">         </span></span></span><span style="font-size: small; font-family: Calibri;">Explicit multi-threading doesn&#8217;t scale well as the number of cores increases. </span></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-size: small; font-family: Calibri;">Multi-threading&#8217;s explicit &#8220;threads&#8221; suggest that you use <a title="task parallelism" href="http://en.wikipedia.org/wiki/Task_parallelism" target="_blank">task parallelism </a>as a model. But task parallelism doesn’t scale well because if you have only K threads where K is some constant, you’re never going to get a speedup over more than K cores.</span></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-size: small;"><span style="font-family: Calibri;">The threading model is built around the concept of task, where every task has a separate sequence of control.<span style="mso-spacerun: yes;">  </span>RapidMind does not use task parallelism but uses <strong>data parallelism.</strong> Data parallelism is based on the fact that applications often operate on collections of data, and units of work are often associated with separate elements of such collections. <strong>The RapidMind platform uses an SPMD (Single Program, Multiple Data) stream programming model. Using an SPMD data-parallelism model, allows you to work with familiar concepts like functions and arrays but also directly express parallel algorithms.</strong></span></span></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-size: small; font-family: Calibri;">And a final issue:</span></p>
<p class="MsoListParagraph" style="margin: 0in 0in 10pt 0.5in; text-indent: -0.25in; mso-list: l0 level1 lfo1;"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore;"><span style="font-size: small;">·</span><span style="font: 7pt ">         </span></span></span><span style="font-size: small; font-family: Calibri;">Explicit multi-threading can&#8217;t leverage the use of accelerators.</span></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-size: small; font-family: Calibri;">Accelerators are parallel machines, and multi-threading is a way to express parallelism in your code. However, the OS threading APIs only target the main CPU cores on which the OS and applications themselves are running. Taking advantage of accelerators requires using further, accelerator-specific, APIs. Furthermore, languages such as C, C++, Java, and almost all other programming languages in common use, assume a shared-memory programming model, where all memory is accessible equally by all computational devices. This isn&#8217;t the case for GPUs, for example, where the GPU has a separate memory that it accesses, and data must be explicitly transferred between main memory and GPU memory.</span></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><strong><span style="font-size: small;"><span style="font-family: Calibri;">RapidMind automatically manages synchronization between the host and the accelerator, so that you don&#8217;t have to.</span></span></strong></p>
<p class="MsoNormal" style="margin: 0in 0in 10pt;"><span style="font-size: small; font-family: Calibri;"> </span></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.rapidmind.com/2008/06/05/nitrogen-narcosis-part-ii-the-serious-drawbacks-of-explicit-multi-threading/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Multi-Threading: The Nitrogen Narcosis of Programming</title>
		<link>http://blogs.rapidmind.com/2008/06/05/multi-threading-the-nitrogen-narcosis-of-programming/</link>
		<comments>http://blogs.rapidmind.com/2008/06/05/multi-threading-the-nitrogen-narcosis-of-programming/#comments</comments>
		<pubDate>Thu, 05 Jun 2008 19:52:32 +0000</pubDate>
		<dc:creator>Yvonne Chypchar - Contributing Editor</dc:creator>
		
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://blogs.rapidmind.com/?p=19</guid>
		<description><![CDATA[Explicit multi-threading is like the delusional thinking one has diving at a depth of 100 feet with too much nitrogen in the blood stream. You stumble, you misread your gages, everything takes longer to do. It can happen even to experienced divers.
The remedy is to ascend and the delusional thinking goes away immediately.

It&#8217;s not the [...]]]></description>
			<content:encoded><![CDATA[<p>Explicit multi-threading is like the delusional thinking one has diving at a depth of 100 feet with too much nitrogen in the blood stream. You stumble, you misread your gages, everything takes longer to do. It can happen even to experienced divers.</p>
<p>The remedy is to ascend and the delusional thinking goes away immediately.</p>
<p style="text-align: center;"><a href="http://blogs.rapidmind.com/wp-content/uploads/2008/05/diver1.jpg"><img class="alignnone size-medium wp-image-26" title="diver1" src="http://blogs.rapidmind.com/wp-content/uploads/2008/05/diver1.jpg" alt="" width="150" height="113" /></a></p>
<p>It&#8217;s not the bends, but <strong><a title="nitrogen narcosis" href="http://en.wikipedia.org/wiki/Nitrogen_narcosis" target="_blank">nitrogen narcosis </a>is dangerous and completely avoidable. And so is multi-threading</strong>.</p>
<p>You can manually multi-thread an application to take advantage of multiple cores, but these projects are time consuming and error prone. Multi-threaded applications are more difficult to develop and test than single-threaded applications. Even experienced software developers find multi-threading challenging and multi-threaded applications hard to debug. <em>Did you misread your gage again? Are you really sure you have air left in your tank?</em></p>
<p>Like a nitrogen-inebriated diver, you need help to ascend. Otherwise you&#8217;ll be stuck at the bottom thinking everything is just fine. Meanwhile, the truth is that multi-threading doesn&#8217;t scale well as the number of cores increases. And it can&#8217;t leverage the use of accelerators. More importantly, it doesn&#8217;t address the issue of memory management (a crucial factor in performance).</p>
<p>RapidMind helps you ascend. The RapidMind platform manages memory, does per-core parallelization as well as multi-core parallelization, transparently enables the use of accelerators (if present), and avoids the debugging challenges of multi-threading.</p>
<p>As a developer you work with a single thread of execution. You write code in standard C++ and use your existing skills, tools and processes. The RapidMind platform then parallelizes the application across multiple cores and manages its execution.<br />
When you ascend, suddenly everything is manageable again.<br />
<em></em></p>
<p><em>In the next posts, I&#8217;ll discuss some of the specific problems with multi-threading and the advantages of using the RapidMind platform.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.rapidmind.com/2008/06/05/multi-threading-the-nitrogen-narcosis-of-programming/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How we get improved performance on a single core - Part 2</title>
		<link>http://blogs.rapidmind.com/2008/05/27/how-we-get-improved-performance-on-a-single-core-part-2/</link>
		<comments>http://blogs.rapidmind.com/2008/05/27/how-we-get-improved-performance-on-a-single-core-part-2/#comments</comments>
		<pubDate>Tue, 27 May 2008 14:57:11 +0000</pubDate>
		<dc:creator>Stefanus Du Toit</dc:creator>
		
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://blogs.rapidmind.com/?p=24</guid>
		<description><![CDATA[In my last post I blogged about the fact that RapidMind-enabled C++ code often gets improved performance even on a single core. I gave one reason for this, our programming model. In this post I&#8217;d like to address another reason: runtime program generation. Read on for part two of this series on single core performance with RapidMind.
RapidMind-enabled [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://blogs.rapidmind.com/2008/05/15/how-we-get-improved-performance-on-a-single-core-part-1/">last post</a> I blogged about the fact that RapidMind-enabled C++ code often gets improved performance even on a single core. I gave one reason for this, our programming model. In this post I&#8217;d like to address another reason: <strong>runtime program generation</strong>. Read on for part two of this series on single core performance with RapidMind.</p>
<p><span id="more-24"></span>RapidMind-enabled C++ code is still C++ code. To be precise, RapidMind-enabled code is C++ code which, at runtime, builds RapidMind programs. Let me give you an example. With RapidMind, you might write the following (extremely trivial) program:</p>
<pre><span style="color: #333399;"><strong>Program</strong></span> my_add = <span style="color: #333399;"><strong>RM_BEGIN</strong></span> {
<span style="color: #333399;"><strong>In</strong></span>&lt;<span style="color: #333399;"><strong>Value1f</strong></span>&gt; a, b;
<span style="color: #333399;"><strong>Out</strong></span>&lt;<span style="color: #333399;"><strong>Value1f</strong></span>&gt; c;
c = a + b; 
} <span style="color: #333399;"><strong>RM_END</strong></span>;</pre>
<p>In case you&#8217;ve never seen RapidMind code before, this snippet declares a RapidMind program (like a C++ function, but more powerful) called &#8220;my_add&#8221;. This program is defined to take two inputs (<strong>a</strong> and <strong>b</strong>), and produce a single output (<strong>c</strong>), all of which are scalar floating point values. It produces <strong>c</strong> by adding together the values in<strong> a</strong> and <strong>b</strong>, and that&#8217;s it. This is all valid C++ code – all the words you see in blue are types or macros provided by the RapidMind libraries and headers. Later on, you can take <strong>my_add</strong> and apply it to arrays of numbers, and RapidMind will generate machine code and manage its execution (in parallel!) for you.</p>
<p>Those lines of C++ code compile (through your C++ compiler) into a dialogue between your application and RapidMind. If we imagined this being played out on a stage, it might look something like this:</p>
<pre>Program my_add</pre>
<p><strong>Application:</strong> Hey, RapidMind! I want to make a new program! I want to call it <strong>my_add</strong>.<br />
<strong>RapidMind:</strong> OK. Gotcha. Here you go, here&#8217;s your new program. It&#8217;s empty right now.</p>
<pre>my_add = RM_BEGIN</pre>
<p><strong>Application</strong><strong>:</strong> And hey, I&#8217;d like to define a new body of code, and store it in <strong>my_add</strong>. OK?<br />
<strong>RapidMind:</strong> Yup. Let me just hang on to a copy of this <strong>my_add</strong> program, so I know where to put anything you might want to define next.</p>
<pre>In&lt;Value1f&gt; a, b;</pre>
<div>
<p><strong>Application</strong><strong>:</strong> That program you&#8217;re building up for me? I&#8217;d like to add two inputs to it, please. Call them <strong>a</strong> and <strong>b</strong>.<br />
<strong>RapidMind</strong><strong>:</strong> Ah, good thing I&#8217;m holding onto this <strong>my_add</strong> program for you. I&#8217;ll just add those two inputs to it. Here they are, do with them as you please.</p>
<div>
<pre>Out&lt;Value1f&gt; c;</pre>
<p><strong>Application:</strong> Also, I&#8217;d like to add an output to that program I&#8217;m building. Call it <strong>c</strong>.<br />
<strong>RapidMind:</strong> No worries. Let me just mark that down. One output for <strong>my_add</strong>, it&#8217;s called <strong>c</strong>. Here you go!</p>
<pre>c = a + b; </pre>
<p><strong>Application:</strong> OK, now, remember those inputs from earlier, <strong>a</strong> and<strong> b</strong>? At this point in the program, I&#8217;d like you to remember to add them together, and store them in <strong>c</strong>.<br />
<strong>RapidMind:</strong> Got it. I&#8217;m still holding on to <strong>my_add</strong>, and I&#8217;ll just append to it that you&#8217;d like to add<strong> a</strong> to <strong>b</strong>, and store the result in <strong>c</strong>.</p>
<pre><span style="color: #333399;">RM_END</span></pre>
<p><strong>Application:</strong> Alright. That&#8217;s it for now. <strong>my_add</strong> is done, and I&#8217;m not going to add anything else to it anymore.<br />
<strong>RapidMind:</strong> You&#8217;re the boss. Let me just polish <strong>my_add</strong> a little for you, and here you go. One completed program. Let me know if you want to do something with it later.</p>
<p> </p>
<p>Yeah, perhaps parallel programs would not make the most interesting of plays. But hopefully the point comes across: when &#8220;a&#8221; and &#8220;b&#8221; are &#8220;added&#8221; in the above example, your application isn&#8217;t telling RapidMind to add them together right then and there. It&#8217;s telling RapidMind to <em>remember</em> to add them together, when you use <strong>my_add</strong> later on. And this dialogue is going on at runtime, when those lines of C++ are executing. If &#8220;a&#8221; and &#8220;b&#8221; were regular C++ <em>floats</em>, your app would be having a similar dialogue with your C++ compiler at <em>compile</em> time, and at runtime, those lines really would mean &#8220;add a to b <em>right now</em>&#8220;.</p>
<p>So, what&#8217;s so special about this? Well, this is a pretty powerful mechanism! By delaying the construction of these programs until runtime, we can start doing some very natural tricks, like the following:</p>
<pre><strong>for</strong> (<strong>int</strong> i = 0; i &lt; 10; ++i) { c = c + a; }</pre>
<p>Think for a moment about this code. It looks pretty plain. &#8220;Add a to c ten times&#8221;. However, what this code is really saying is &#8220;tell RapidMind to remember to add a to c&#8221; ten times in a row. It&#8217;s like the following dialogue:</p>
<p><strong>Application:</strong> Hey, remember <strong>c</strong> and <strong>a</strong>? Could you remember to add them together and update the result in c<strong>?</strong><br />
<strong>RapidMind:</strong> Sure thing. I&#8217;ll append that to the list of things you&#8217;d like this program to do.<br />
<strong>Application:</strong> Hey, remember <strong>c</strong> and <strong>a</strong>? Could you remember to add them together and update the result in <strong>c</strong>?<br />
<strong>RapidMind:</strong> Uh, yeah, sure thing. I&#8217;ll append that to the list of things you&#8217;d like this program to do.<br />
<strong>Application:</strong> Hey, remember <strong>c</strong> and <strong>a</strong>? Could you remember to add them together and update the result in <strong>c</strong>?<br />
<strong>RapidMind:</strong> Umm OK&#8230; Once again, I&#8217;ll append that to the list of things you&#8217;d like this program to do. Boy, this app is pretty boring!<br />
&#8230;and so forth, 7 more times&#8230; </p>
<p>In compiler terminology, this is called <em>unrolling</em>, and it&#8217;s a pretty handy optimization sometimes to avoid lots of branching that regular loops can cause. Compilers really aren&#8217;t always very good at unrolling for you, because they often can&#8217;t tell if it&#8217;s safe to do so. Of course you can still write regular (non-unrolled) loops in RapidMind too, but you can just as easily write unrolled loops. If you&#8217;ve tried to unroll loops by hand in C or C++ code, you know that it usually leads to a lot of code repetition, ugly C macros, external scripts, or something along those lines.</p>
<p>This is just one example of how runtime program generation can be used to implement some optimization. You can also do things like generate different specialized versions of a program for different tasks, optimize a computation for a particular piece of data you only know at runtime, try out different ways of implementing a program and choose between them at run-time (at no cost!), etc. The list goes on and on. There&#8217;s also some inherent benefit to do doing this – for example, modularity constructs like classes in C++ don&#8217;t end up taking any overhead in your computations, but can still be used to structure your code cleanly.</p>
<p>These kinds of things are particularly useful when you start tuning parameters, which brings me back to the high-level topic of this post: how C++ code can get better performance on a single core with RapidMind than without it. We often implement algorithms that have all sorts of &#8220;tuning knobs.&#8221; When we do so, we typically write them in a very generic way – we don&#8217;t fix any of these knobs, we just make them available in the code. Then we go and tune away (by hand or with tools we&#8217;ve developed for this purpose). At the end, we end up with some specific parameter settings that give <strong>optimal performance</strong><strong>, without sacrificing flexibility or code quality</strong>!</p>
<p>Thus, runtime program generation is a very powerful tool to have at your disposal when it comes to writing high performance code. And it&#8217;s just one of the neat tools you get when you use RapidMind to express your applications!</p>
<p>I&#8217;ll be discussing another, related, reason for RapidMind&#8217;s high performance even without multiple cores in my next post. Stay tuned!</p>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://blogs.rapidmind.com/2008/05/27/how-we-get-improved-performance-on-a-single-core-part-2/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Required Reading</title>
		<link>http://blogs.rapidmind.com/2008/05/24/required-reading/</link>
		<comments>http://blogs.rapidmind.com/2008/05/24/required-reading/#comments</comments>
		<pubDate>Sat, 24 May 2008 18:42:33 +0000</pubDate>
		<dc:creator>Dr. Michael McCool</dc:creator>
		
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://blogs.rapidmind.com/?p=23</guid>
		<description><![CDATA[I frequently get asked to recommend articles on multi-core/many-core software development, GPU architectures, the Cell BE, and parallel programming models. Conveniently, special issues of three major technical publications have just appeared covering these very topics. The March/April ACM Queue discusses the use of GPUs as general-purpose computational engines; the May Proceedings of the IEEE covers [...]]]></description>
			<content:encoded><![CDATA[<p>I frequently get asked to recommend articles on multi-core/many-core software development, GPU architectures, the Cell BE, and parallel programming models. Conveniently, special issues of three major technical publications have just appeared covering these very topics. The March/April ACM Queue discusses the use of GPUs as general-purpose computational engines; the May Proceedings of the IEEE covers commodity multi-core and many-core processors and programming models; and finally, the April IEEE Computer targets “data-intensive computing” and includes a couple of interesting articles that discuss high-performance multi-core processing and also discuss the use of the Cell BE and GPUs for specific applications in database search and pattern analysis.</p>
<p>Of course if you were <em>really</em> serious about the topic of multi-core and many-core development you would read these issues cover to cover, but in this post I’m going to comment on (and recommend) a subset of these articles.<br />
<span id="more-23"></span><br />
The March/April ACM Queue special issue is specifically about using GPUs for general-purpose computation (something we&#8217;ve been doing since at least 1999).  The whole issue has excellent coverage of this topic, but I would especially recommend the article entitled “GPUs: A closer look” by Kayvon Fatahalian and Mike Houston.  Basically, this article describes the architecture of contemporary GPUs in detail.  GPUs have a fairly complex architecture with multiple levels of hardware parallelism, and include multiple cores, massive multithreading, and SIMD tiling, as well as very aggressive mechanisms for latency hiding.  While programming platforms like RapidMind do abstract away much of this complexity, the details are interesting and useful to know when performance tuning.  For example, the article explains why using too much state in a GPU kernel can decrease performance due to resource oversubscription, and why it’s necessary to give a GPU lots (and lots and lots) of parallelism to work with for maximum performance.</p>
<p>The May special issue of the Proceedings of the IEEE on &#8220;Cutting-Edge Computing&#8221; also contains a number of useful articles, including one I wrote surveying scalable programming models, including the SPMD programming model used by RapidMind.   This article compares and contrasts a number of multi-core architectures (including CPUs, the Cell BE, and both NVIDIA and ATI GPUs), discusses both task and data-parallel programming models, and shows how the SPMD model maps efficiently onto all of these architectures.   This issue also includes good articles on the evolution of GPUs, a discussion of mobile GPUs, and a discussion of simulation, recognition, and synthesis workloads.</p>
<p>The final special issue, of IEEE Computer, is not directly targetting multi-core or many-core processors, but does discuss data-intensive workloads such as data mining and recognition tasks.   The workloads discussed include both image and string data search, and cover applications in GIS, medical imaging, and spam detection. These articles discuss strategies for parallelizing these workloads on many-core processors, including on the Cell Be and on GPUs.</p>
<p>Although we&#8217;ve been involved in this revolution since the beginning, It&#8217;s nice to see that the community is starting to take many-core architectures and programming more seriously.  There are tremendous opportunities here, and a fundamental shift in how computers are built and programmed is underway that will lead to vastly improved performance.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.rapidmind.com/2008/05/24/required-reading/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How we get improved performance on a single core - Part 1</title>
		<link>http://blogs.rapidmind.com/2008/05/15/how-we-get-improved-performance-on-a-single-core-part-1/</link>
		<comments>http://blogs.rapidmind.com/2008/05/15/how-we-get-improved-performance-on-a-single-core-part-1/#comments</comments>
		<pubDate>Thu, 15 May 2008 20:23:58 +0000</pubDate>
		<dc:creator>Stefanus Du Toit</dc:creator>
		
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://blogs.rapidmind.com/?p=21</guid>
		<description><![CDATA[RapidMind is all about achieving the full performance potential of modern multi-core processors. Generally, total performance is a combination of two factors:
                          total performance = scalability across cores × per-core performance
It&#8217;s probably no surprise that RapidMind aims to provide excellent scalability across cores. What might surprise you is that we spend significant amounts of effort on [...]]]></description>
			<content:encoded><![CDATA[<p>RapidMind is all about achieving the full performance potential of modern multi-core processors. Generally, total performance is a combination of two factors:</p>
<p style="text-align: left;">                          total performance = <em>scalability across cores</em> × <em>per-core performance</em></p>
<p>It&#8217;s probably no surprise that RapidMind aims to provide excellent scalability across cores. What might surprise you is that we spend significant amounts of effort on single-core performance as well, and RapidMind-enabled apps often outperform code written in C/C++ without RapidMind significantly. Some of our customer case studies show improvements like &#8220;25x faster than non-RapidMind-code on 8 cores&#8221; – that 25x number is made up of perfect scaling across 8 cores, combined with a 3x performance advantage even on a single core.</p>
<p>How&#8217;s that possible? I&#8217;m going to explore this in my next few blog posts. Read on for the first reason why we get such good performance:</p>
<h4>Reason 1: Our programming model</h4>
<p><span id="more-21"></span>RapidMind is built entirely inside of C++. When you write RapidMind code, you&#8217;re writing C++, nothing else. The code you write, however, is written to use our primitives (&#8221;Values, Arrays and Programs&#8221;) to express its computational and data-related operations. We&#8217;re extremely careful when designing these primitives to avoid making decisions that make it difficult for our platform to optimize your code (while providing you with all the flexibility you need to express your computations, and without adding significant burden on your part).</p>
<p>For example, computations expressed in RapidMind don&#8217;t have &#8220;plain pointers&#8221; like C/C++, where a pointer could (potentially) point to any particular area in memory. Instead, you have pointers to the beginning of any given array (these are called Accessors and ArrayRefs in RapidMind parlance) and you combine these with indices to look up data in an array. This makes it much easier for the platform to determine when two references to memory might refer to the same location (this is known as <em>alias analysis</em>) which in turn makes it much easier for our backends to optimize your computations.</p>
<p>Our programming model is also much more explicit about locality of computations and data than pure C++. This helps us allocate memory efficiently, optimize the right chunks of code together, perform cache optimizations, and much more. Good expression of locality is critical to achieving good scalability, but it&#8217;s also an invaluable aid in getting good single-core performance.</p>
<p>By designing our programming model to keep in mind what&#8217;s needed to generate efficient code on modern architectures, we are able to get a performance advantage over plain C/C++, which are burdened by language choices that can make it difficult for compilers and processors to do their jobs well. Luckily it turns out a lot of the choices that allow better scalability also allow better single-core performance, and vice versa. Additionally we do this without sacrificing the syntax, modularity concepts (such as functions, classes, and templates) and tools available to C++ developers.</p>
<p><em>Stay tuned for more reasons we can achieve such high performance per core in my next blog post!</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.rapidmind.com/2008/05/15/how-we-get-improved-performance-on-a-single-core-part-1/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Meaningful Benchmarks</title>
		<link>http://blogs.rapidmind.com/2008/05/13/meaningful-benchmarks/</link>
		<comments>http://blogs.rapidmind.com/2008/05/13/meaningful-benchmarks/#comments</comments>
		<pubDate>Tue, 13 May 2008 20:35:41 +0000</pubDate>
		<dc:creator>Dr. Michael McCool</dc:creator>
		
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://blogs.rapidmind.com/?p=22</guid>
		<description><![CDATA[Lately, I&#8217;ve been thinking about apples and oranges, as in the &#8220;comparing of.&#8221;  In other words, benchmarks. The RapidMind platform enables the development of high-performance software, and so we often need to quantify the performance improvements made possible by our technology. However, benchmarks must be done with care in order to be meaningful, and I [...]]]></description>
			<content:encoded><![CDATA[<p>Lately, I&#8217;ve been thinking about apples and oranges, as in the &#8220;comparing of.&#8221;  In other words, benchmarks. The RapidMind platform enables the development of high-performance software, and so we often need to quantify the performance improvements made possible by our technology. However, benchmarks must be done with care in order to be meaningful, and I want to discuss a few of the issues here and our philosophy in setting up good benchmarks.</p>
<p><span id="more-22"></span>Benchmarks involve comparing the performance of (at least) two things. For example, we may run the same program on two different processors to compare the performance of these processors, or we may compare two different implementations of the same algorithm on the same processor to compare implementation strategies, or we may compare the performance of two different algorithms for solving the same problem.</p>
<p>A benchmark is an experiment. Whenever possible, in order to get meaningful results from an experiment, we should vary one thing at a time, and control the other variables. Unfortunately, when comparing a serial implementation to a parallel implementation, we may have to change the algorithm in order to achieve a parallel implementation, as not all serial algorithms are parallelizable. The other problem is that when moving between processors, an implementation of an algorithm that is optimal for one processor may not be optimal for another. Finally, we often have to change many other things, such as the operating system or the compiler.</p>
<p>Since it may be necessary to change multiple variables when moving from one implementation to another, we have to compare the best possible performance available on either side of a benchmark. It is more reliable to compare the peak performance of tuned implementations than that of untuned implementations.</p>
<p>We use two strategies to achieve peak performance on either side of a benchmark when comparing a RapidMind implementation to a non-RapidMind implementation.</p>
<p>First of all, on the RapidMind side we can use autotuning. RapidMind implementations of algorithms can be parameterized, and these parameters can be chosen using an automatic tuning process to achieve an optimal implementation of that algorithm on a given processor. This is one of the ways that the RapidMind platform supports <em>portable</em> high-performance implementations of algorithms.</p>
<p>Second, we need to seek an independently developed and tuned baseline for any benchmark. For example, in 2006 we did a financial benchmark in cooperation with HP&#8217;s High Performance Computing Division. We both started from the same unoptimized baseline code. Then, we did a RapidMind implementation and they tuned the baseline. In the end, after tuning and using appropriate compiler flags with Intel&#8217;s icc (an excellent optimizing compiler), the researchers at HPCD achieved over a 3x speedup over the original, unoptimized baseline code. However, the RapidMind implementation was over 32x faster when we ran on an NVIDIA 7800 GPU. Recently, we added x86 CPU support to RapidMind, and find we are still getting a 17x speedup when running against this tuned (single core) baseline on a dual Intel quad-core machine. In other words, RapidMind&#8217;s implementation is more than 2x as fast per core on the same processor. In addition, the RapidMind implementation is now over 140x faster than the baseline code when running on newer NVIDIA G80 GPUs.</p>
<p>Since then we have done a number of other benchmarks, using baselines that were either independently developed (for example, in public libraries) or that we have requested independent partners to tune for us. It is important to note that by using tuned baselines, we can be confident that the results of our benchmarks are realistic, and in fact are conservative, speedups.</p>
<p>Of course benchmarks do not tell the entire story, since productivity and portability are also important considerations. However, if performance is the primary goal of moving to multi-core processors and accelerators, then we have to be able to measure it properly.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.rapidmind.com/2008/05/13/meaningful-benchmarks/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Performance: What&#8217;s it For?</title>
		<link>http://blogs.rapidmind.com/2008/05/01/performance-whats-it-for/</link>
		<comments>http://blogs.rapidmind.com/2008/05/01/performance-whats-it-for/#comments</comments>
		<pubDate>Thu, 01 May 2008 13:20:43 +0000</pubDate>
		<dc:creator>Dr. Michael McCool</dc:creator>
		
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://blogs.rapidmind.com/?p=18</guid>
		<description><![CDATA[It almost seems like a silly question: what good is higher performance?   The answer depends on the context.   Supercomputers of course, are built for high performance, and are often built expressly to run application workloads that demand it.   But what good is high performance on ordinary servers and desktops, [...]]]></description>
			<content:encoded><![CDATA[<p>It almost seems like a silly question: what good is higher performance?   The answer depends on the context.   Supercomputers of course, are built for high performance, and are often built expressly to run application workloads that demand it.   But what good is high performance on ordinary servers and desktops, and for what kinds of applications?</p>
<p>This is an important question for RapidMind, since we help developers squeeze every drop of performance out of all the processors in their computers.    Fortunately, many applications need or want all the performance they can get.</p>
<p><span id="more-18"></span>Note the two general categories of performance requirements: &#8220;need&#8221; and &#8220;want.&#8221;</p>
<p>Some applications just <em>need</em> performance.   They may have an absolute requirement, such as the need to meet a certain frame rate, to respond to real-time events within a certain latency, or to complete a financial analysis in time to close a trade.   For these applications, &#8220;slow&#8221; is equivalent to &#8220;broken.&#8221;</p>
<p>Then there are the applications just that &#8220;want&#8221; performance, to a greater or lesser degree.   There can be several reasons for this, but often this desire can be expressed as an improvement in efficiency.   Obtaining higher performance on the same hardware translates into higher space and power efficiency.   This can in turn can translate into performing the same computation on fewer machines or with less power, or to tackling larger or more accurate computations within a fixed space and power budget.</p>
<p>For the data center, a desire for efficiency is easy to understand.   For the end consumer, improved power efficiency can mean better battery life for mobile devices, less heat, or a lower carbon footprint.   On the other hand, some computations can also be scaled up; bigger databases can be searched, higher resolution images can be processed, deeper (and more accurate) financial models can be built and processed within the trading window.</p>
<p>Towards the softer end of the want/need spectrum are interactive applications, where performance improves the responsiveness of the user interface.   Below a certain level, the user interface becomes unusable and non-interactive.   Conversely, as performance is improved, new types of interactive applications become feasible.   It is now becoming possible, for example, to deploy interactive applications that use real-time physics simulation, machine learning, or computer vision.</p>
<p>To put this in perspective, RapidMind has been able to achieve one or even two orders of magnitude improvements in performance, relative to applications written using traditional approaches on the same hardware.   We also target multi-core devices, and processors are going to continue scaling up in performance following Moore&#8217;s law by adding more cores.</p>
<p>What can an additional two orders of magnitude do for the efficiency of existing applications, and what new applications will such radical improvements in performance enable? The answer is, improvements in performance will enable completely new kinds of interactive applications to be built.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.rapidmind.com/2008/05/01/performance-whats-it-for/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
