Point Sorting into Cells

I am trying to create a rasterizer in CUDA: now, i am using a straigth algorithm where each triangle has a gpu thread where it draws its pixels.
Obviously this kills any coalescing, leads to overdraw, and has many write conflicts where a pixel belongs to more than one polygon…

So i was trying to flip the algorithm, so that each pixel finds its containing triangle:
one would need to sort the triangle into an array of cells that subdivides the screen, and then to depth-sort the triangles in each cell.
This way each pixel can retrieve a small ordered set of triangles (hopefully) fast and not memory-intensive to traverse.

Do you know some good examples of how to have a (fast as possible) spatial sort like this?
Each tile could contain from one to all the triangles, and from what i know GPUs don’t like linked memory…
I’ll look in the n-body and particles demos now, but I wanted to have some ideas from more experienced peoples :thumbup:

Thanks for any inputs! (and sorry for the bad english :rolleyes: )

The particles sample in the SDK shows how to to implement a simple uniform grid in CUDA. The basic idea is to sort the items based on which grid cell they are in (using radix sort), and then find the start of each cell by comparing the cell ids of each item with its neighbour in the sorted list.

This new paper from the High Performance Graphics conference extends this idea to build 3D grids where triangles can cover more than one grid cell for ray tracing:

Hope this helps!

Yes i saw the radixsort.cu file yesterday, but i think it’s really too complex for me, i wouldn’t know where to start to use it as i want… also it’s huge, it’s longer than 200 lines of pure CUDA…

I think i need something much simpler: it is easier than the Particle demo because the grid has fixed bounds and resolution, as it is 2D and built in Screen Space.

My problem here is dynamic memory and concurrent access, because it’s hard to determine where a triangle has to be written without knowing how many they will be in the same tile.

Anyway thanks for the papers, they look just like the explanation that i need :thumbup:

There is an alternate method to the sorting. In my applications (3D particles in fixed dimension with a relatively uniform density), it is faster than the sorting method. You can find working testbench code from my open source project here: http://trac2.assembla.com/hoomd/browser/tr…/gpu_binning.cu

  1. rebin_simple_kernel simply does an atomicAdd for each particle, placing it in the cell. In my tests, it is no faster than copying the particles to the CPU, putting them in cells and then copying the cells back. YMMV. It is extremely simple to code, but the colliding atomicAdd’s are what limit the performance of this method. Randomly placed particles actually perform better with this method than particles with some amount of data locality. This is because particles that end up int he same cell are virtually guaranteed to have a colliding atomicAdd if those particles are processed by the same block.

  2. rebin_simple_sort_kernel uses the sort idea from the global sort method, but only does it on a per block level. Then with a complicated series of scans and stream compaction type operations, only one atomicAdd is done for each set of particles belonging to the same cell in that block.

Because method 2) only has an advantage over method 1) when particles being processed by the same block end up in the same cell, it degenerates into method 1) for randomly placed particles. In my application, I periodically sort the particles on an Hilbert curve so that particles near each other in space are near each other in memory. In this situation: rebin_simple_sort_kernel is about 4-5x faster than doing the binning on the CPU (even including the memory transfers).

Unfortunately, I have yet to come up with any better methods yet :( I’m OK with a 4-5x speedup, but I’d really like to see at least 30.

Edit: 2D vs 3D doesn’t really change anything, just the way the data structures are addressed and the way the cell index is calculated.

Very interesting… anyway i did some benchmarks on Particles.exe and came out that the radixsort actually takes 40-50% of a frame, so it would be really too slow for me.

The rasterizing currently takes 60% of the frame, if we count also the other passes (zsorting/selecting/interpolating) it could as well be slower.

The paper that Simon Green linked looks good but the least binning time is 11 milliseconds for them on a GTX280, so obviously it’s too slow for any less-than-optimal GPU.

The rebin_simple_kernel is really interesting, it’s strange that it becomes so slow…

does the atomic operation serialize the warp?

Intuitively it should be really fast as it’s coalesced reads + some int operations + scattered write…

EDIT: maybe you could split the sort in 3 passes to avoid the atomic operation:

-first you write in each particle its cell (coalesced write + read, no more than 80ms in my case)

-then you “tell” each particle its position into the cell

-then you write each particle into its final position (coalesced read + write (on > 1.1), also fast)

The hard thing is the second pass, as each result depends on a previous result… it could be like a multiple reduction?

How many particles are you looking at binning? My default benchmark runs 64000 particles at a number density of 0.4 with cell widths of 3.8.

Benchmarks from the code I linked on Tesla S1070 (just one GPU):

with the data-locality sort:

Host				 : 1.149482 ms

Host w/device memcpy: 2.440125 ms

GPU/simple		  : 1.938408 ms

GPU/simple/sort/ 64	  : 0.441432 ms

with completely randomly placed particles

Host			  : 1.200956 ms

Host w/device memcpy: 2.491269 ms

GPU/simple		  : 0.489611 ms

GPU/simple/sort/ 64	  : 0.679762 ms

So the rebin_simple_kernel is actually not that bad with randomly placed particles. It bins 64,000 particles in just ~0.5 ms. The problem for me comes when the particles are sorted for data-locality when that time jumps up to ~2 ms (almost as long as it would take to do on the host, including memcpy times). This increase has to be due to the increase in collisions of the number of threads trying to atomicAdd the same location. I don’t know exactly how the hardware handles it, but whatever happens has to serialize the execution to some degree.

I think the slow overall performance comes from the lack of coalesced writes more so than the atomicAdd performance. I work it out to be only about an effective 3.4 GiB/s.

Indeed. I’ve thought along these lines, but never fully worked out the details or came up with a simple way to program it.

Uhm, because i’m making a rasterizer, a maximum polycount would be around 150.000 triangles (now my stress-test contains 345000), that is not much compared to newer games.

Now i’m using as a test a 69000 polys model, that is binned with your modified code in 19000 microseconds, according to the cudaprofiler…
that leads to a framerate of 21 fps against the old framerate of 32 fps… and i’m not counting the real rasterizing.

I set it up with 4096 cells with a side of 16 pixels, that can contain 64 indexes (exactly 1 mb VRAM)…
maybe as you say the simple sort is not much suited because a polygon mesh has highly localized data.

I made all those tests on a crappy 8600 GT (and win7)… maybe is this card that can’t manage well newer CUDA functions?


I think I have similiar problem. I am doing particle simulation of the polutant cloud and need to handle with different meteorological parameters in every point in time and 3D space for every particle :(
This leads to many uncoalesced reads and writes because plenty of particles randomly access the global memory where meteo is stored or access the same place.

Then I used the sorting trick from SDK particles - sorting by grid hash (based on my meteo grid) and doing some scans so I am able to group together particles with the same meteo params (or at least the same tile in the meteo grid). Then I am loading the meteo params to the shared memory and doing the interpolation there. It increases the speed a litlle bit comparing the firs approach - 2x - 2.5x. But the botlleneck I see is the same - still a lot of uncoalesced reads and writes beacause the particles position, velocities and other properties are now unordered from the point of the group of the same meteo tile (I do not sort whole particle structure as this will be quite expensive I think, only particle indexes are sorted together with the meteo tile hashes).

Do you have some idea how to avoid the huge uncoalecsed RWs?

Just for info - I use 1M of particles for now but probably ther will be more (10M?).


Hmm. The 8600 GT only has 22 GiB/s of memory bandwidth. Just assuming a memory limited computation, that would make your test 7.27 times faster on the top of the line GTX 285, or it would probably run in under 2.6ms. That is pretty close to in line with the default 3D benchmarks of a similar particle number.

You are almost there. The particle sort is the really important thing. But to get the most benefit from it, you should read your particles by binding the data to a 1D texture and reading with tex1Dfetch.

Also, it is not critical, but a Hilbert curve sort of the particles leads to a few percent more performance than sorting by grid index alone. If you want more details, see the paper we published on MD simulations: http://www.ameslab.gov/hoomd/about.html#paper. Although, the hilbert curve sort is only a small section in the large paper.

In the end i think that i will just scatter as little information possible in each pixel ( triangle index, barycentric coords in triangle) to then delegate interpolation and the bulk of writes (normal, position, depth, velocity, etc etc) to a “pixel shader”.

In the end the whole thing wasn’t slow, i forget to align buffers… who knew that you have to specify it manually also if the size is 8 or 16 >.<
Now the brute-force rasterizer runs up to 70 FPS… i should retry also your rebin, maybe it has the same speedup.