Sorting bottleneck in rendering fractals a call for ideas

cbuchner1 · November 11, 2009, 2:11pm

Hi everybody,

just a little call for ideas. Just for fun I am trying to accelerate fractal rendering with the GPU. The fractals are generated using a random descent into an iterated function system (playing “the chaos game”).

This rendering consists of three phases:

Iteration phase:
Based on an iterative random process I build a list of screen coordinates and resulting color contributions. Each x/y coordinate is assigned a “bin” identifier (which simply represents a pixel). These screen coordinates are badly scattered due to the random method of their generation
Sorting phase:
Sort the color contributions by bin (pixel) identifier. I am using the fast radix sort from the SmokeParticles demo.
Color accumulation phase:
For each screen pixel, go through all bin entries that correspond to this pixel, and add the color contributions. To find the first entry for a pixel, I use a fast binary search. Then I write the result into the frame buffer.

Phase 1) is fast and writes the pixel log using fully coalesced memory access.

Phase 2) seems to be the major bottleneck. Consumes 90% of the entire time.

For Phase 3) to be fast, I need the color contributions to be sorted by screen pixel, so I can accumulate them up in shared memory and use coalesced writes for the results. Without doing any sorting, I would have major performance issues because the writes would be scattered all over the frame buffer, I would have to use atomic writes to prevent collisions, etc…

Can you think of any other way, to perform some kind of pre-binning of the pixels in phase 1) already, without breaking memory coalescing during the writes? I would really like to shorten the huge sorting time of phase 2)

Christian

jack · November 11, 2009, 2:14pm

I don’t know if the SmokeParticles demo has been updated or not, but I believe that the sorting methods in the latest version of CUDPP are (currently) the fastest available. Perhaps you could try using them to see if it speeds up “Phase 2” at all.

cbuchner1 · November 11, 2009, 3:16pm

Well it shouldn’t matter much if I used thrust::sort, CUDPP V1.1 sort or the radix sort from SmokeParticles. They’re all good radix sort implementations I believe. Effectively I am sorting a list of unsigned 32 bit integers. I squeezed my bin number / color entries into single 32 bit integers to save bandwidth. The bandwidth is the limiting factor in sorting performance and can’t be increased other than by upgrading to a faster CUDA device.

I have added some strategic cudaThreadSynchronize() calls to my timing. Before that I was judging performance solely by relying on fprintf(stderr, “…”) output, which was misleading.

On my nVidia GT 220 card

Phase 1 typically takes 70-100 ms

Phase 2 typically takes 350 ms (sorting about 44 million elements per second)

Phase 3 typically takes 75-200 ms, depending on the “local concentration” of the fractal shape.

So it is not as bad as I originally made it sound. Sorting accounts for maybe one half to two thirds of the entire algorithm run time.

The whole thing is repeated a couple of times before the fractal achieves good enough quality (i.e. low noise) to be finally rendered to screen. The whole fractal takes about 5 seconds to render completely at a 1280x800 screen resolution (2x spatial oversampling applied on each axis). I was hoping for a little more speed.

Christian

Qazax · November 11, 2009, 3:34pm

you guys seen fractron 9000?

cbuchner1 · November 11, 2009, 3:43pm

Yes. Seen it. Excellent. Want to program my own stuff though. I am a programmer, not a user. (why does that remind me of the Tron movie?)

Also this tool is wasting time by updating the screen while rendering - and it also takes a decent amount of time to render a noise-less fractal at high resolution.

The editing features are top notch though.

Simon_Green · November 11, 2009, 6:30pm

Why wouldn’t you use the graphics API for rendering the points? It’s hard to beat the hardware rasterizer for scattering into a 2D buffer!

cbuchner1 · November 11, 2009, 7:42pm

Umm until now I thought the rasterizer is only good at gather operations (texture reads, dependent texture reads, etc…), but it always has to write to the output pixel given by the rasterizer (admittedly my knowledge stops at the DirectX9 level). I thought CUDA intruduced the brave new world of scatter.

UPDATE:

I did a little research and found an AMD paper describing a histogram algorithm running on the GPU that is based on a scattered write of point primitives using a vertex shader. The vertex shader determines the target pixel for the write, which can for example be based on a texture read from another source image (the source image from which to take the histogram).

http://developer.amd.com/media/gpu_assets/…on_preprint.pdf

This could be a way to go, but I am sceptical whether this would beat CUDA performance. For my destination buffer, I would have to use additive blending, for example by means of a pixel shader.

Christian

cbuchner1 · November 11, 2009, 10:35pm

Okay, I’ve tried all three radix sorts I was able to get my hands on. I am sorting keys only (no values), 32 bit uints.

For tests with CUDPP I’ve replaced the old CUDPP headers and .a library files in the CUDA SDK 2.3 with the CUDPP V1.1 release version.

Numbers taken on nVidia 9600M GT (32 shaders, 4 SMs)

Note how CUDPP radix sort is faster when built in Debug mode (make dbg=1). Peculiar, huuuh?

cudpp sort (release):		   665.596008 ms

cudpp sort (debug):			 564.648010 ms (numbers fluctuating between this and 620 ms)

thrust::sort:				  564.656006 ms

SmokeParticles radix sort:	 666.138000 ms

in comparison the other kernels

iterate_kernel: 97.428001 ms

merge_kernel: 204.858994 ms

I actually believe the SmokeParticles radix sort made it into CUDPP V1.1 with little or no changes because I had various symbol name collisions linking against both ;)

The winner is: Thrust V1.1!

It also has the simplest API to use. A one-liner, as opposed to plan generation or class instantiation for the other two APIs:

thrust::sort(thrust::device_pointer_cast(d_elements), thrust::device_pointer_cast(d_elements) + num_elements);

Keldor314 · November 12, 2009, 12:11am

For flam4, I just used unsorted random writes without using atomics. Since there generally millions of pixels in an image, the frequency of collisions is low enough that there is no visible error due to collisions - using atomics to ensure that there are no collisions is serious overkill, particularly since the worst possible effect of a collision is that a given point would simply end up not added to the accumulation buffer.

In order to get any appreciable coherency in writing the pixels in a sorted fashion, you’d have to store tens or hundreds of thousands of points at a time, which will certainly end up off-chip. Even then, you’d have to do some magic to make sure that you wrote to a set of consecutive bins, which would end up losing coherency any time the bins did not have identical numbers of points in each one. This ends up causing you to use up to 3x more bandwidth, since you have to write out coordinates, read them back, and then write out points. In addition, the computation and the memory access are now serialized - the computation won’t overlap with the memory transaction. On top of this, the sort itself adds another log(numPoints) memory accesses for each point. Quite frankly, just accepting the incoherence will probably give better peformance than trying to do any sorting.

One thing that you can do to increase coherence with random points is to write the point’s contribution to shared memory, then have four threads (one per channel) write the point to memory. This would give you 4x better coherence than having each thread write its own points independently.

cbuchner1 · November 12, 2009, 12:12am

Ha! and that simplicity just bit me. The thrust API does not easily allow me to specify the number of significant bits in that are supposed to be used in the radix sort.

By dropping to 24 significant bits for sorting, I was able to get the other two sorting APIs down to ~450 ms of time because there is no need to also include the color map index (8 bits) in the sort.

Keldor314 · November 12, 2009, 12:21am

To put it simply, let’s look at a half warp of points (16) and see how many transactions it would take to store them with either method:

1: Writing directly to memory, no sorting: 16 read/writes
2: Staging to shared memory, and writing with different threads per color channel, still no sorting: 4 read/writes
3: Storing points: 1 read/write to point list, several read/writes during sorting, at least 1 read/write during accumulation

It should be pretty clear that approach 2 wins over all. In fact, if the kernal is computation bound (as is flam4) then choice 3 would always come in dead last since the memory access and the computation can’t be done in parallel.

cbuchner1 · November 12, 2009, 12:37am

Thanks keldor for your postings, I will think about them tomorrow as it is pretty late here.

I am thinking about something like a hybrid approach where I only sort very coarsely into different screen tiles (possibly in the iterate kernel already). And then within each screen tile I would use shared mem atomics (on compute 1.2 devices) to avoid any collisions during the accumulation.

Here’s another person’s view on the topic:
[url=“http://strobe.cc/articles/do_androids_render”]http://strobe.cc/articles/do_androids_render[/url]

bye, good night.

UPDATE: by the way the timings allow me to compute a points throughput per second for the three kernels combined, which currently puts me around between 8 and 13 millions/s on this device (32 shaders, 4 SMs) - depending a lot on the fractal that is rendered.
UPDATE2: an nVidia 8800GT currently reaches ~60 Mio points/sec.

Christian

Simon_Green · November 12, 2009, 10:38am

What I was suggesting was using CUDA to generate an array of positions (float4s), and then using OpenGL interop to render these as points (similar to the simpleGL sample in the SDK). Doing additive blending to a floating point frame buffer is fast these days (probably faster than atomics).

Keldor - I like the idea of just using regular global writes and ignoring the collisions - only the strong survive!

Keldor314 · November 12, 2009, 2:01pm

I should probably note that I have yet to get my second method to work well. In fact, my tests show it performing at half the speed of the first method. This might be because of the compiler not being able to optimize the casts and index operations needed to have four threads each take a component of a float4.

In any case, definitely try the brute force method of each thread writing its own points directly to memory, with no sorting, staging, or otherwise before attempting anything else.

I’ve gotten greater than 1Bio iterations/sec with flam4 on my GTX295+8800GTX system. 750Mio iterations/sec is a more typical number, though. Note that flam4 uses plain ol’ method 1.

cbuchner1 · November 13, 2009, 7:54pm

I am attaching my first CUDA port of Scott Draves’ Fractal Flames which he published in 1994. The original “reference” C implementation is archived here, but it is a bit tricky to compile because it’s old: http://flam3.com/old-releases/original/FLAME_TAR.GZ .My version suffers from the sorting bottleneck still, but achieves a 3x speed-up over CPU processing on my 9600M GT GPU equipped laptop.

It should build on the Linux and Mac CUDA SDKs just fine inside the regular project folder hierarchy with the regular SDK’s makefile.

Usage:

…/…/bin/linux/release/hqi <all.place

and watch it generate 206 beautiful ppm files at 1920x1200 resolution within a few minutes to half an hour (depending on your GPU’s speed)

The graphics settings are read from a file called “nice.template”. Edit this file to change the image resolution and quality setting.

Use tools like e.g. gwenview on Linux to watch these ppm images or ToyViewer on Mac.

Christian

cbuchner1 · November 14, 2009, 11:25pm

I tried hacking with the simpleGL sample, as you suggested. I tried a VBO containing 3 floats for coordinates per vertex followed by 4 unsigned chars for the color. My CUDA kernel generated all these with a make_float4() call, where the last element provided color through an __int_as_float() call. Still nicely coalesced during writes.

But sorry to be a party pooper, but with a mesh size of 2048 x 2048 CUDA spends so much time in the cudaGLMapBufferObject() and cudaGLUnmapBufferObject() API calls that my frame rate drops to just 6fps. This about means a throughput of 25 millions points per second only.

When deactivating these two API calls except in one frame out of 100, and by letting the CUDA kernel dump its data to a normally allocated region of device memory instead, performance climbs to 35 FPS, which means a throughput of ~150 Mio points (and I am still rendering all these points in every frame!). But of course I need interoperability in every frame - so it’s a dead end.

So I believe you when you say the rasterizer is fast, but the current CUDA OpenGL interoperability APIs in the CUDA 2.3 drivers are pretty uninteresting from a performance standpoint. I don’t want to replace my sorting bottleneck with an OpenGL VBO interoperability bottleneck.

All of this was tested with OpenSuse 11.1 on 32 bit, 9600M GT

Christian

Simon_Green · November 15, 2009, 1:53pm

That seems slow. Are you sure you are calling cudaGLSetGLDevice()? If you don’t do this interop will fall back to a slow path.

Also, if you CUDA kernel is always writing to the whole buffer (and not reading), you should set the flags like this for best performance:

cudaGLSetBufferObjectMapFlags(bufObj, cudaGLMapFlagsWriteDiscard);

cbuchner1 · November 15, 2009, 4:57pm

The above flag helped a little. 6 FPS to 9 FPS.

You know what really helps?

http://forums.nvidia.com/index.php?showtopic=60756

Disabling dual monitor output.

40 FPS with interop

55 FPS without interop.

According to this I should be able to achieve a points throughput of up to 167 million per second on this card. Now we’re talking.

I still need to test performance with blending into a floating point buffer instead of simple rendering to screen with a depth buffer.

mthiesen · November 17, 2009, 12:07pm

This is also what I ended up doing for Fractron 9000. I tried using atomics, but noticed no quality improvement in the output so I just went with simple writes. I’ve toyed with the idea of somehow storing and sorting the points for coalesced writes, but it doesn’t seem worth all of the overhead. But who knows, maybe there are some tricks I haven’t thought of.

Ooooh, I didn’t think of that. I’ll have to try this with Fractron 9000 and see if it speeds things up at all.

Keldor314 · November 18, 2009, 7:33am

If you can get it to work, let me know. I haven’t had any success, though that might be due to the compiler/optimizer not liking the pointer arithmetic.

Topic		Replies	Views
Point Sorting into Cells CUDA Programming and Performance	9	3475	June 23, 2009
Code: efficient sorting with CUDA CUDA Programming and Performance	40	61946	December 20, 2011
bucket sort on GPU Any ideas? CUDA Programming and Performance	25	19642	November 15, 2020
Rasterization strategy CUDA Programming and Performance	14	9728	January 11, 2011
SRTS Radix Sort: High Performance and Scalable GPU Radix Sorting CUDA Programming and Performance	58	19262	May 27, 2013
Median Filter (time dimension) for images (bad performance, no memory coalescence CUDA Programming and Performance	19	6329	November 5, 2024
My CUDA program fails for larger grids CUDA Programming and Performance	11	1590	October 11, 2010
Sorting in CUDA Is sorting in CUDA worth the trouble? CUDA Programming and Performance	15	8325	September 30, 2009
Sort on GPU Need some help to use sorts... CUDA Programming and Performance	15	42075	June 19, 2008
Fastest sorting algorithm on GPU currently CUDA Programming and Performance	20	45969	May 2, 2021

Sorting bottleneck in rendering fractals a call for ideas

Related topics