Sunday, April 12, 2009

nVidia talks Larrabee

Here
http://www.fudzilla.com/index.php?option=com_content&task=view&id=13118&Itemid=1
And here for the full details
http://bits.blogs.nytimes.com/2009/04/09/hello-dally-nvidia-scientist-breaks-silence-criticizes-intel/

There are interesting bits in there, and there are just plain wrong bits in there.

Clocks, Power, and Heat

Yes, its true CPUs have hit a point where faster on a particular process means an unacceptable increase in power and heat.

No, GPUs are not clocked anywhere near that fast.

For instance, Intel CPUs reached a core clock of 3.8 Ghz before this “wall” was identified. And this was not really an impassible “wall” its just a point where consumer machines cannot accept more power and heat and be effective for consumers.

GPUS, like the nV GT 295, have a core clock on the order of 600mhz to 1Ghz. True, the memory clock is faster but you get the idea

In terms of power and heat, some generations of GPUs have required massive cooling fans and 2 power connections.

So the power and heat issue is clearly one that all vendors have run into.

Many smaller cores versus fewer larger cores

Yes, lots of smaller cores operating in parallel can be a more effective processing unit on certain types of workloads given the right software.

Yes, that is exactly the Larrabee approach, e.g. let’s throw lots of smaller cores together at the problem. Each core isn’t as powerful as the current CPU cores, but taken in aggregate and again with the right software – this approach can yield impressive benefits.

Nothing new here, Intel is merely doing something similar to what the GPU vendors have been doing (massively parallel smaller processors) with a twist (much more programmatic freedom).

What Intel has done is recognize a programming paradigm that helps programmers take advantage of many-core processing power and makes it easier to write parallel code and is making silicon plans to exploit this paradigm.

Graphics is inherently parallel, which helps GPUs perform so well.

One largely unrecognized aspect of why GPUs perform so well is the fact that the workload they operate on is inherently parallel. And the parallelization happens behind the scenes.

This lets GPUs avoid the main reason why parallel programming for multi-core CPUs is lagging – it’s hard for programmers to parallelize. If the programmer doesn’t have to deal with parallelization and gets it for free –well gee that’s an obvious win.

How do GPUs do that?

The standard graphics APIs, Direct3D and OpenGL, have provided a vehicle for programmers to use and access the parallelization potential of GPUs simply by using the API.

And the GPU vendors have benefited from this “auto-parallelization” in the API, driver, and hardware in that they get parallel workloads without graphics programmers (typically games programmers) having to hurt their brain to parallelize their algorithms.

Now, it’s arguable that 3D programming is also somewhat mind-bending but let’s not get distracted with that just yet and see how this voodoo works.

Vectors are 4-tuples, that is they contain x,y,z,w components. Vector operations work on each component and programmers get parallelism essentially for free without having to think about it. This is called SIMD, single instruction multiple data. This happens “behind the API” without programmer intervention. Simply queue up your triangles and the 3 vertices (vectors) for each and the associated math work gets essentially auto-vectorized in the vertex pipeline.

Beyond just the vector processing, triangles and pixels can also be parallelized and spread across the many smaller cores in GPUs. Triangles typically have no knowledge of each other and are independent units, great for auto-parallelization. For pixels, think of the screen as a rectangular grid of pixels, and pixel processing for each pixel across the screen can be farmed out to separate processors when certain types of rendering are being performed.

The API and driver can manipulate these units of work (vectors, triangles, pixels) in a parallel manner without the programmer knowing what’s going on. Now, it’s true the API does impose some restrictions to do this (the bit about triangles not knowing about each other, for instance) but you learn to live with those as a graphics programmer pretty quickly.

What about other parallel work?

Good question.

There is this “thing” called GPGPU. General Purpose GPU. The research there is around using the GPU for non-graphics workloads.

This area really started to get “hot” around 2004 because Shader Model 3.0 had enough flexibility to let GPGPU tackle some interesting problems.

The main issue there was you had to warp your brain around abusing the graphics API to do general purpose computing. You basically rendered 1 fullscreen rectangle to get parallel computing for each pixel ( your array of threads for calculation ) and abused textures as storage arrays ( and had to deal with 0..1 versus -1...1 range issues based on texture types accepted for various operations and parts of the pipeline ) and if you could stand that brain-warp you could do GPGPU in your pixel shader.

However, the number of programmers who could perform such brain-warping was limited. GPGPU was mainly a research topic for graduate students at various universities.

So here the graphics API had now become a limiting factor in extending the success of the GPU beyond graphics.

In 2007, almost co-incident with the release of Vista and several months after the launch of the G80 chip ( 8800 family, I was there on-stage with a presentation as a TWIMTBP partner representing Flight Simulator ) nVidia launched CUDA ( Compute Unified Device Architecture ), a new API to enable GPGPU style programming without having to use the graphics APIs.

CUDA introduces both a new little language (C-like) and a programming API and style that enables extending the power of the many smaller cores of the GPU to workloads other than graphics.

However, there are still restrictions, both language and tool. The C on CUDA is still a “little language”. See
http://en.wikipedia.org/wiki/CUDA
for more details on the language, the API, and the limitations. Some that are not covered, for instance, are the lack of great debugger support although that is slowly being addressed. Even the iXBTLabs detailed article:
http://ixbtlabs.com/articles3/video/cuda-1-p1.html
does not cover the tools/debugger issue in depth.

Still this is a huge improvement, even with the language and tool restrictions, over the graphics API abuse that was the only way to perform GPGPU work previously.

Fast forward to late 2008, both the Direct3D and OpenGL camps have responded to this innovation; Direct3D11 with ComputeShaders, OpenGL with OpenCL.

So we have gone from a desert island to an embarrassing plethora of support in 2+ years; and now GPGPU programming to enable programmers to access the many cores of the GPU “for free” for non-graphics workloads is roughly a peer of graphics programming.

What is a “Little Language”?

Yes, I have now used that term twice without a definition. Shame on me.

The term Little Language was introduced by Jon Bentley of Bell Labs in a Communcations of the ACM article and in his Programming Pearls books ( highly recommended ). I quote from the page I justed linked:

One spin-off of the UnixDesignPhilosophy was the realization that it is easier to implement a task-specific language optimized for that task than it is to implement a general-purpose language optimized for all possible uses

Language is a tool that programmers should and do use. The origination of shader programming in Direct3D was a landmark extension of the tool of languages to graphics programming and all the shader models are by definition a “little language”.

CUDA, ComputeShader, and OpenCL are all “little languages” for general purpose computing.

However, little languages have their issues. They are by design not general; and are thus not intended for all possible uses.

And thus we see the rub.

Programmatic Freedom

Programmatic freedom is amongst the enticing things that Larrabee brings to the table.

What do I mean by that?

We are already up to Shader Model 5. And now we have Compute Shader 1.

At what point is the endless “revising” of the “little languages” for graphics and compute (whose support for general purpose workloads begs the question) going to run up against the need for full generality? Shader Model 15? Compute Shader 3?

The fact that it will is incontrovertible. The question is then merely when.

Larrabee jumps past all of that and gives programmers full C/C++ language support now.

There is no back room deal between the IHVs and the API developers that limits what programmers get to the “least common denominator vanilla” flavor that runs great but is less filling, you get it all. And you get it now, before the convergence of Shader Model n and Compute Shader m.

Now, having that generality does provide a lot of rope to potentially hang yourself, but that is nothing new to skilled programmers.

So while Larrabee is a nod to the progress of the GPU, it does leap past the “think little” mentality that has permeated the graphics IHV teams and the graphics API teams. And that’s without considering issues like memory architecture and caches, tool support, and a host of other features for which Larrabee brings goodness to the table.

So to close this post down, everyone in the chip business today (Intel, AMD, ATI, nVidia ) has the clocks, power, and heat issue. And everyone wants to reach the nirvana of enabling parallel processing for general purpose workloads without heroic programmer intervention.

Larrabee is a leap ahead in many areas as an attempt to give programmers this functionality. Is it perfect? No. But I do think its going to impress programmers when they get their hands on it.

Oh, one final thing. It says so on the banner page but let me reinforce it - these are my own opinions, no one else’s, and I speak for no one but myself when I present an analysis like this. So any attempt to conflate my opinion with Intel’s is flawed and just plain wrong.

Happy Easter.

0 comments: