Sorting bottleneck in rendering fractals a call for ideas

Hi everybody,

just a little call for ideas. Just for fun I am trying to accelerate fractal rendering with the GPU. The fractals are generated using a random descent into an iterated function system (playing “the chaos game”).

This rendering consists of three phases:

  1. Iteration phase:
    Based on an iterative random process I build a list of screen coordinates and resulting color contributions. Each x/y coordinate is assigned a “bin” identifier (which simply represents a pixel). These screen coordinates are badly scattered due to the random method of their generation

  2. Sorting phase:
    Sort the color contributions by bin (pixel) identifier. I am using the fast radix sort from the SmokeParticles demo.

  3. Color accumulation phase:
    For each screen pixel, go through all bin entries that correspond to this pixel, and add the color contributions. To find the first entry for a pixel, I use a fast binary search. Then I write the result into the frame buffer.

Phase 1) is fast and writes the pixel log using fully coalesced memory access.

Phase 2) seems to be the major bottleneck. Consumes 90% of the entire time.

For Phase 3) to be fast, I need the color contributions to be sorted by screen pixel, so I can accumulate them up in shared memory and use coalesced writes for the results. Without doing any sorting, I would have major performance issues because the writes would be scattered all over the frame buffer, I would have to use atomic writes to prevent collisions, etc…

Can you think of any other way, to perform some kind of pre-binning of the pixels in phase 1) already, without breaking memory coalescing during the writes? I would really like to shorten the huge sorting time of phase 2)

Christian

I don’t know if the SmokeParticles demo has been updated or not, but I believe that the sorting methods in the latest version of CUDPP are (currently) the fastest available. Perhaps you could try using them to see if it speeds up “Phase 2” at all.

Well it shouldn’t matter much if I used thrust::sort, CUDPP V1.1 sort or the radix sort from SmokeParticles. They’re all good radix sort implementations I believe. Effectively I am sorting a list of unsigned 32 bit integers. I squeezed my bin number / color entries into single 32 bit integers to save bandwidth. The bandwidth is the limiting factor in sorting performance and can’t be increased other than by upgrading to a faster CUDA device.

I have added some strategic cudaThreadSynchronize() calls to my timing. Before that I was judging performance solely by relying on fprintf(stderr, “…”) output, which was misleading.

On my nVidia GT 220 card

Phase 1 typically takes 70-100 ms

Phase 2 typically takes 350 ms (sorting about 44 million elements per second)

Phase 3 typically takes 75-200 ms, depending on the “local concentration” of the fractal shape.

So it is not as bad as I originally made it sound. Sorting accounts for maybe one half to two thirds of the entire algorithm run time.

The whole thing is repeated a couple of times before the fractal achieves good enough quality (i.e. low noise) to be finally rendered to screen. The whole fractal takes about 5 seconds to render completely at a 1280x800 screen resolution (2x spatial oversampling applied on each axis). I was hoping for a little more speed.

Christian

you guys seen fractron 9000?

Yes. Seen it. Excellent. Want to program my own stuff though. I am a programmer, not a user. (why does that remind me of the Tron movie?)

Also this tool is wasting time by updating the screen while rendering - and it also takes a decent amount of time to render a noise-less fractal at high resolution.

The editing features are top notch though.

Why wouldn’t you use the graphics API for rendering the points? It’s hard to beat the hardware rasterizer for scattering into a 2D buffer!

Umm until now I thought the rasterizer is only good at gather operations (texture reads, dependent texture reads, etc…), but it always has to write to the output pixel given by the rasterizer (admittedly my knowledge stops at the DirectX9 level). I thought CUDA intruduced the brave new world of scatter.

UPDATE:

I did a little research and found an AMD paper describing a histogram algorithm running on the GPU that is based on a scattered write of point primitives using a vertex shader. The vertex shader determines the target pixel for the write, which can for example be based on a texture read from another source image (the source image from which to take the histogram).

http://developer.amd.com/media/gpu_assets/…on_preprint.pdf

This could be a way to go, but I am sceptical whether this would beat CUDA performance. For my destination buffer, I would have to use additive blending, for example by means of a pixel shader.

Christian

Okay, I’ve tried all three radix sorts I was able to get my hands on. I am sorting keys only (no values), 32 bit uints.

For tests with CUDPP I’ve replaced the old CUDPP headers and .a library files in the CUDA SDK 2.3 with the CUDPP V1.1 release version.

Numbers taken on nVidia 9600M GT (32 shaders, 4 SMs)

Note how CUDPP radix sort is faster when built in Debug mode (make dbg=1). Peculiar, huuuh?

cudpp sort (release):		   665.596008 ms

cudpp sort (debug):			 564.648010 ms (numbers fluctuating between this and 620 ms)

thrust::sort:				  564.656006 ms

SmokeParticles radix sort:	 666.138000 ms

in comparison the other kernels

iterate_kernel: 97.428001 ms

merge_kernel: 204.858994 ms

I actually believe the SmokeParticles radix sort made it into CUDPP V1.1 with little or no changes because I had various symbol name collisions linking against both ;)

The winner is: Thrust V1.1!

It also has the simplest API to use. A one-liner, as opposed to plan generation or class instantiation for the other two APIs:

thrust::sort(thrust::device_pointer_cast(d_elements), thrust::device_pointer_cast(d_elements) + num_elements);

For flam4, I just used unsorted random writes without using atomics. Since there generally millions of pixels in an image, the frequency of collisions is low enough that there is no visible error due to collisions - using atomics to ensure that there are no collisions is serious overkill, particularly since the worst possible effect of a collision is that a given point would simply end up not added to the accumulation buffer.

In order to get any appreciable coherency in writing the pixels in a sorted fashion, you’d have to store tens or hundreds of thousands of points at a time, which will certainly end up off-chip. Even then, you’d have to do some magic to make sure that you wrote to a set of consecutive bins, which would end up losing coherency any time the bins did not have identical numbers of points in each one. This ends up causing you to use up to 3x more bandwidth, since you have to write out coordinates, read them back, and then write out points. In addition, the computation and the memory access are now serialized - the computation won’t overlap with the memory transaction. On top of this, the sort itself adds another log(numPoints) memory accesses for each point. Quite frankly, just accepting the incoherence will probably give better peformance than trying to do any sorting.

One thing that you can do to increase coherence with random points is to write the point’s contribution to shared memory, then have four threads (one per channel) write the point to memory. This would give you 4x better coherence than having each thread write its own points independently.

Ha! and that simplicity just bit me. The thrust API does not easily allow me to specify the number of significant bits in that are supposed to be used in the radix sort.

By dropping to 24 significant bits for sorting, I was able to get the other two sorting APIs down to ~450 ms of time because there is no need to also include the color map index (8 bits) in the sort.

To put it simply, let’s look at a half warp of points (16) and see how many transactions it would take to store them with either method:

1: Writing directly to memory, no sorting: 16 read/writes
2: Staging to shared memory, and writing with different threads per color channel, still no sorting: 4 read/writes
3: Storing points: 1 read/write to point list, several read/writes during sorting, at least 1 read/write during accumulation

It should be pretty clear that approach 2 wins over all. In fact, if the kernal is computation bound (as is flam4) then choice 3 would always come in dead last since the memory access and the computation can’t be done in parallel.

Thanks keldor for your postings, I will think about them tomorrow as it is pretty late here.

I am thinking about something like a hybrid approach where I only sort very coarsely into different screen tiles (possibly in the iterate kernel already). And then within each screen tile I would use shared mem atomics (on compute 1.2 devices) to avoid any collisions during the accumulation.

Here’s another person’s view on the topic:
http://strobe.cc/articles/do_androids_render

bye, good night.

UPDATE: by the way the timings allow me to compute a points throughput per second for the three kernels combined, which currently puts me around between 8 and 13 millions/s on this device (32 shaders, 4 SMs) - depending a lot on the fractal that is rendered.
UPDATE2: an nVidia 8800GT currently reaches ~60 Mio points/sec.

Christian

What I was suggesting was using CUDA to generate an array of positions (float4s), and then using OpenGL interop to render these as points (similar to the simpleGL sample in the SDK). Doing additive blending to a floating point frame buffer is fast these days (probably faster than atomics).

Keldor - I like the idea of just using regular global writes and ignoring the collisions - only the strong survive!

I should probably note that I have yet to get my second method to work well. In fact, my tests show it performing at half the speed of the first method. This might be because of the compiler not being able to optimize the casts and index operations needed to have four threads each take a component of a float4.

In any case, definitely try the brute force method of each thread writing its own points directly to memory, with no sorting, staging, or otherwise before attempting anything else.

I’ve gotten greater than 1Bio iterations/sec with flam4 on my GTX295+8800GTX system. 750Mio iterations/sec is a more typical number, though. Note that flam4 uses plain ol’ method 1.

I am attaching my first CUDA port of Scott Draves’ Fractal Flames which he published in 1994. The original “reference” C implementation is archived here, but it is a bit tricky to compile because it’s old: http://flam3.com/old-releases/original/FLAME_TAR.GZ .My version suffers from the sorting bottleneck still, but achieves a 3x speed-up over CPU processing on my 9600M GT GPU equipped laptop.

It should build on the Linux and Mac CUDA SDKs just fine inside the regular project folder hierarchy with the regular SDK’s makefile.

Usage:

…/…/bin/linux/release/hqi <all.place

and watch it generate 206 beautiful ppm files at 1920x1200 resolution within a few minutes to half an hour (depending on your GPU’s speed)

The graphics settings are read from a file called “nice.template”. Edit this file to change the image resolution and quality setting.

Use tools like e.g. gwenview on Linux to watch these ppm images or ToyViewer on Mac.

Christian

I tried hacking with the simpleGL sample, as you suggested. I tried a VBO containing 3 floats for coordinates per vertex followed by 4 unsigned chars for the color. My CUDA kernel generated all these with a make_float4() call, where the last element provided color through an __int_as_float() call. Still nicely coalesced during writes.

But sorry to be a party pooper, but with a mesh size of 2048 x 2048 CUDA spends so much time in the cudaGLMapBufferObject() and cudaGLUnmapBufferObject() API calls that my frame rate drops to just 6fps. This about means a throughput of 25 millions points per second only.

When deactivating these two API calls except in one frame out of 100, and by letting the CUDA kernel dump its data to a normally allocated region of device memory instead, performance climbs to 35 FPS, which means a throughput of ~150 Mio points (and I am still rendering all these points in every frame!). But of course I need interoperability in every frame - so it’s a dead end.

So I believe you when you say the rasterizer is fast, but the current CUDA OpenGL interoperability APIs in the CUDA 2.3 drivers are pretty uninteresting from a performance standpoint. I don’t want to replace my sorting bottleneck with an OpenGL VBO interoperability bottleneck.

All of this was tested with OpenSuse 11.1 on 32 bit, 9600M GT

Christian

That seems slow. Are you sure you are calling cudaGLSetGLDevice()? If you don’t do this interop will fall back to a slow path.

Also, if you CUDA kernel is always writing to the whole buffer (and not reading), you should set the flags like this for best performance:

cudaGLSetBufferObjectMapFlags(bufObj, cudaGLMapFlagsWriteDiscard);

The above flag helped a little. 6 FPS to 9 FPS.

You know what really helps?

http://forums.nvidia.com/index.php?showtopic=60756

Disabling dual monitor output.

40 FPS with interop

55 FPS without interop.

According to this I should be able to achieve a points throughput of up to 167 million per second on this card. Now we’re talking.

I still need to test performance with blending into a floating point buffer instead of simple rendering to screen with a depth buffer.

This is also what I ended up doing for Fractron 9000. I tried using atomics, but noticed no quality improvement in the output so I just went with simple writes. I’ve toyed with the idea of somehow storing and sorting the points for coalesced writes, but it doesn’t seem worth all of the overhead. But who knows, maybe there are some tricks I haven’t thought of.

Ooooh, I didn’t think of that. I’ll have to try this with Fractron 9000 and see if it speeds things up at all.

If you can get it to work, let me know. I haven’t had any success, though that might be due to the compiler/optimizer not liking the pointer arithmetic.