High Performance Graphics and SIGGRAPH 09 report

Last week was the 3-day High Performance Graphics conference, held just before SIGGRAPH. It’s a small conference with about 250 attendees but of particular interest to GPU algorithm research. It’s true that the focus is on graphics applications, but in practice the problems of graphics tend to be similar to the problems of supercomputing, with lots of emphasis on data representation, memory organization, task scheduling, and I/O.

The 23 papers are mostly available online. As an interesting sample of what platforms people are concentrating their research on, here’s a breakdown of the hardware and environment the papers target.

CUDA: 9 papers
OpenGL/DirectX 5 papers
CPU: 4 papers
Larrabee: 2 papers
Custom hardware: 2 papers

Analyze the list above any way you like… though it may be noted that the only people who even mentioned AMD GPUs were from AMD. (They presented a paper on 2D edge antialiasing in DirectX).

By far the most interesting paper was by Timo Aila and Samuli Laine of Nvidia. It basically asked “we all know that memory bandwidth is the limiting bottleneck of spatial traversal on the GPU. How much of a bottleneck is it really?” Their clever approach was to take a working CUDA GPU raytracer and record step by step statistics of its behavior, so a simulator could “replay” computation with varying memory behavior, thread work strategies, and so on. And the surprising conclusion was that memory is not the worst bottleneck, but instead thread work divergence was the main inefficiency. This allowed them to experiment with per-thread work pools to grab new jobs as necessary. That adds overhead but it combats the asymmetric workloads so well that there’s a net 2X speed increase. The analysis method could likely be applied to many types of GPU problems from particle simulation to finite element computation.

The other impressive paper was from Stanford and Intel about using Larrabee for REYES like rendering. (This is a method of tesselating objects in screen space to render them in batches. It’s used for Pixar’s Renderman renderer.) The crazy idea of doing this in realtime has significant (and surprising) promise. However, after I saw this paper, there were multiple people who were all agog over another truly impressive REYES rendering system paper (RenderAnts, buiilt on top of CUDA) coming up at SIGGRAPH Asia. (More on RenderAnts later…)

Third most interesting paper (IMHO!) was doing photon mapping in real time on the GPU… for video games. Again, this is a classic algorithm you never expect to hear the words “real time” used with. And even if you do, you don’t expect it for a game! But yet again, this paper was kind of eclipsed later in the week.

There were two keynotes, both of which really accentuated the fact that this is a huge transition period in graphics. Larry Gritz of Sony Imageworks (formerly of Pixar where he was the chief architect of Renderman for many years, also formerly of NVIDIA where he was architect of Gelato) discussed why (and how) he’s moving his focus from REYES style rendering for high quality (movie production) rendering to raytracing. (Answer: raytracing is fast and flexible, and REYES advantages are diminishing with more and more complex models).

The second keynote was Tim Sweeney (archetect of the Unreal game engine) discussing how powerful GPUs are and how the (infinite) demands of games need to move to massive parallel engines for all parts of computation. He was enamoured of Larrabee and its flexibility.

SIGGRAPH followed immediately afterwards. And even though that’s a huge conference, it wasn’t as interesting as HPG.

Again GPU computation is now ubiquitous… it’s certainly a tool that’s used in research all the time. You no longer see papers introducting GPUs as some exotic new tool, saying “look, I took a classic algorithm and made it work on the GPU!” but instead “Here’s a new algorithm, we implemented it on CPU and GPU.”

One rather spectacular exception was “An Efficient GPU-based Approach for Interactive Global Illumination” which was a great, great systems paper. This took four or five big, complex, difficult algorithms (any one of which is a HUGE problem for the GPU and worthwhile of research) and implemented ALL of them on the GPU… simultaneously! From building a spatial hierarchy (on the GPU) to raytracing (on the GPU) to photon mapping (on the GPU) to adaptive clustering (on the GPU) to final gathering irradiance estimation (on the GPU)… it was by far (like by 3 or 4 times) the most complex problem I’ve ever seen attempted on the GPU… and they succeeded. Even the questions at the end of the talk were amusing… one questioner was confused and kept asking “so you’re doing the photon mapping on the GPU, but the raytracing, that was CPU, wasn’t it? But the object hierarchical tree, that was precomputed, right?” It was hard to accept that all the steps were on the GPU and the net result was still realtime.

Kun Zhou is a coauthor of this paper, as well as the RenderAnts paper I alluded to before. RenderAnts won’t be presented until December, but from the paper, it’s a complete and complex system, clearly more complete and polished than Intel’s REYES presentation at HPG. Even Renderman shading is translated into GPU shading! The fact that RenderAnts’s output is a direct comparison to real Renderman scenes has a powerful “WOW!” impact. And Kun Zhou’s publications paper is just filled with other eye-opening GPU gems… he’s clearly doing amazing GPU work with a really talented group. (Take a look at his GPU stream debugging paper… just fantastic stuff: creative and useful!)

Some other random snippits:

Johan Andersson, rendering architect for DICE (Battlefield games) gave a talk about game computational bottlenecks and parallel tasking. He had a fascinating slide showing the data (and task) dependencies for rendering a single frame of a game, doing everything from collison detection to “decal” projection to particle simulation to AI to all the (many!) rendering layers and steps. His talk nicely discussed just how much scheduling was important and how GPUs have such promise but they need more adaptability and diverse task parallelism. He repeated a term, “braided parallelism” to help visualize the idea of how tasks may branch out and have parallel bits, but they need to remerge and synchronize their results before fanning out again for the next parallel stage, but you really have a fractal hierarchy of these branchouts and merges, so it’s like a tangle. He didn’t really say “Larabee!” but he was certainly talking about how handling such complex task hierarchies would be handled better by GPUs like Larabee.

So this theme of scheduling, task management, and data dependencies came up several times during the week. Even Timo’s HPG raytracer analysis paper effectively came down to per-thread job scheduling. It’s clear that future GPU programming will start to really wrestle with efficient task handling and efficient job switching both on the data and code level. Kun Zhou’s big GPU successes are more evidence that work and task scheduling is important for any large GPU system.

Another interesting short talk was about depth peeling on the GPU. This is a method used for mostly order independent transparency projection, which is always requires some care to do with OpenGL style rasterizers. There was one paper at HPG using OpenGL and bucket sorting, but this second talk by another author (at SIGGRAPH) used CUDA. This is initially a rather amusing premise: to do polygon rasterization in CUDA!? Without using the fixed function hardware that you already have, and is designed for it!? But for the case of depth peeling, the brute force rasterization is awkward since it needs so many passes. The second incredible premise of the talk was that his CUDA rasterizer used global atomics for all of its per-pixel accumulations, which is just a crazy idea since atomics aren’t meant for such massive abuse. But I guess I’m the crazy one, since he got excellent performance, easily beating the multi-pass rasterization method. It’s quite fun when you see your first skeptical impressions of an idea disproved by successful results!

Finally, before SIGGRAPH, I was most interested in seeing just what Caustic Graphics was up to. They’ve made a lot of PR noise the past few months, but their tech wasn’t so clear; it’s not like they had a product available. They gave a short talk at HPG, and also had a small booth on the SIGGRAPH floor. While their product is still deliberately vague, it seems to be a combination of a FPGA hardware accelerator for raytracing with some clever middleware software to feed the FPGA with rays and keep scheduling new rays (often requested by surface shaders). This is one of the two basic approaches to dealing with dynamic ray scheduling, similar to Matt Pharr’s TORO renderer from 1996. Here Caustic uses ray work and result lists to fight divergent losses. This means that all rays need to be sent to the hardware, then back to the CPU which can either run a shader on them itself or pass it to the GPU for shading, and then any new dynamic rays are accumulated and fired back into the FPGA work queue. This is a standard work strategy… the alternative is computing shading on the fly at intersection time and using a stack of pending rays. This alternative is much, much, more I/O efficient, but can have painful divergent warp inefficiencies. What Caustic has produced is a nice transparent library for their trace/schedule/shade system so you as the app programmer don’t need to worry about the communication so much. The other nice feature (and about the only one that has a good demo) is JIT compilation of shaders so you can write a shader and immediately see its effect on your scene in a second or two. From Caustic’s sales talk, it sounded like they were very proud of the scheduling system, but if you think about their FPGA hardware, they had no choice but to use the scheduling strategy since an FPGA can’t run generic shading code anyway.

So Caustic was interesting but not revolutionary, roughly what I expected. The other new contender is OptiX, the renamed NVIRT project from NVIDIA. Last year, Dave Luebke and Steve Parker showed their CUDA raytracer at SIGGRAPH 08 and at NVISION. I had gotten the impression that they’d release their tracer as a kind of plugin for the NVIDIA scene graph. When it never materialized, it just seemed like a minor project delayed. However it was revealed last week when Steve Parker gave a (packed, standing room only) talk on the new system. The reason it took a year is because it’s not just some simple wrapper around a ray tracer like I expected. Instead it’s a quite clean and simple API… and that API is fully exposed to shaders. The major feature is that shaders themselves are first class objects and have full control over rays, and while the raytracing core is a major part of Optix, it’s even more of a compiler and library for those shaders. Shaders are written in CUDA (!) and have full control over lighting, shading, spawned rays, even custom intersection methods. This was really quite unexpected and was very smoothly presented by Steve Parker with simple code examples (and a nice variety of running examples on the show floor.) While Caustic and Optix are not the same thing, the flexibility of Optix, combined with cheap hardware and support from NVIDIA basically mean that Caustic isn’t first to market, and they are already trailing in appeal, cost, generality, and risk.

After seeing Optix, it’s clear that several of the HPG papers were directly supporting its ray tracing system… NVIDIA presented two papers on ray tracing efficiency in CUDA (including that great analysis paper from Timo and Samuli I mentioned before) and one ray tracing applications paper from Austin Robinson and Pete Shirley. So Optix is already quite useful for research applications!

HPG and SIGGRAPH was a full week and I saw far more than I mentioned here in this post, but these are just some highlights to give some observations and summaries that might be useful.


I encourage everyone to read Timo and Samuli’s paper. You’ll learn some things that you probably don’t already know. (well, maybe not Sylvain. :) )

Very interesting read indeed. This could have pretty big consequences on radiative transport through monte carlo simulations where >10e06 particules have to be simulated, all with different (in divergence and length) histories.

Thanks for pointing it out!

Here’s the slides for Johan Andersson’s talk. Slide 7 shows that data and task dependency graph I was talking about… it’s an awesome visual and makes you quickly realize that for complex simulation, we need to get scheduling onto the GPU so such interdependencies can be handled without having to wait for Mr. Pokey CPU to wake up and think and send its decisions back to the GPU about what to do next.

It’s also a strong evidence for having a GPU which can have multiple kernels running at once. While current GPUs can sort of do this now with superkernels, that’s not elegant and it also means that every subkernel is limited to the register and shared memory requirements of the worst subkernel inside your superkernel.

Well, ive given Aila’s paper a more in depth read, and i have come to the conclusion that i dont know where the acceleration comes from!

From the pseudocode at the end:

__global__ void kernel()

// variables shared by entire warp, place to shared memory

__shared__ volatile int nextRayArray[BLOCKDIM_Y];

__shared__ volatile int rayCountArray[BLOCKDIM_Y] = {0};

volatile int& localPoolNextRay = nextRayArray[threadIdx.y];

volatile int& localPoolRayCount = rayCountArray[threadIdx.y];

while (true) {

// get rays from global to local pool

if (localPoolRayCount==0 && threadIdx.x==0) {

localPoolNextRay = atomicAdd(globalPoolNextRay, B);

localPoolRayCount = B; }

// get rays from local pool

int myRayIndex = localPoolNextRay + threadIdx.x;

if (myRayIndex >= globalPoolRayCount)


if (threadIdx.x==0) {

localPoolNextRay += 32;

localPoolRayCount -= 32; {

// init and execute, these must not exit the kernel




If i understood the paper correctly, they state that exceptionnaly long rays hold the rest of the execution units hostage. That sounds about right. But how is the above “persistant thread” implementation going to reduce the cost of long running rays?

If the trace() function does take exceptionnaly long to run, it will still hold the rest of the warp hostage until it completes.

From their paper:

Isnt the scheduler doing the same thing? If we see the multiple threads as the “pool”, the scheduler picks a warp to execute. If that warp does take a long time to finish, others warps can still be launched on other units.

It’s not like an individual SP can fetch some from the the global pool when it is ready. There is still the need for the whole warp to go at the speed of the slowest.

If anyone can shed some light onto this…

When a CTA (block) completes, what are the requirements that must be met before another CTA can take its place?

That’s the important part, not the behavior of individual warps.

The general idea is, if you didn’t do this - ALL warps would have to wait for the longest warp to complete - before a new CTA is launched, and ANY more work is done.

With this method, other warps in the CTA can pick up a new work load well before the ‘longest running’ warp finishes, instead of idling doing nothing (wasted cycles).

You’re right though, that it doesn’t at all help speed things up inside each individual warp, but there’s nothing you can do with SIMT-style parallel programming in this case - the core of the idea is to stop shorter-running warps from idling (wasting cycles).

Ah yes makes sense. I didnt think that the whole CTA had to be done before warps from another one could be launched but it should have been obvious taking shared memory into consideration. Interesting stuff.

Thanks you two!