Rasterization strategy

I have a project which needs to visualize several billion particles. The simulation and generation of them all can be done in CUDA.
But I want to RASTERIZE those particles… they’ll end up being disks (not points) with 3D position and depth as viewed from a camera.
What’s the best strategy for this kind of visualization?
I assume that using the GPU’s own rasterizer would be most efficient, so I could generate lists of particles in CUDA, export those to OpenGL context, rasterize them in OpenGL, then go back to CUDA for the next load of particles.
This is roughly the strategy used by the Fluids and n-body SDK examples allthough they deal with mere millions and can do them all in one batch.

But would it be effective to do billions of particles this way? I’d probably have to flip flop back and forth between OpenGL and CUDA and do it in batches, would that cause any problem?

Is there any problem doing this on a card NOT being used for display? Especially a Tesla, or maybe a “spare” GPU in a machine.

I don’t expect any of this to be real time, since we’re really talking about brute force of 50B+ 3D spheres, but that’s fine, it’s not meant to be interactive. (yes, I also realize that with that many particles there’s going to be intense occlusion etc, that’s OK, I can work on optimization and culling later but I want a brute force solution first as a baseline.)

Are there other approaches to rasterization, especially any other methods except OpenGL/Direct3D? The other obvious idea is in CUDA itself but that is likely not nearly as efficient as using the rasterization hardware.

Thanks for advice and pointers. I’m more of a science guy than an OpenGL hacker so the rasterization is all new to me.

With that number of particles? Ray-trace them. Once you get above a few hundred million primitives, ray-tracing tends to become faster than straight rasterization. Also, spheres have very simple collision tests. Besides, the CUDA-OpenGL interop is very much lacking in performance. You’d probably get best results rendering in CUDA and then just binding the result into a texture. This is especially true if you’re thinking about streaming the particles over to a different GPU! There just isn’t enough bandwidth to do that and utilize anywhere near the second card’s peak performance.

OpenGL? Ray-tracing??

Nah, nah. If all you want to do is draw circles, do it from CUDA itself. Use smem as the framebuffer, and you’ll get incredible performance. (Honestly, way better than going through a 3D API. I mean, how are you gonna do it from there? Full-body 3D spheres with a mesh? Fuggedaboutit. Simple sprites? Ok, but then you’ll still be fetching textures and writing out to the real global-mem buffer. I think a single-minded circle-drawing kernel could beat the usual methods handily.)

P.S. Whether Tesla or a non-active GPU allows OpenGL/DirectX is a good question. I think the answer is no.

The reason I mentioned ray tracing was because of the line where he mentioned rendering 50 billion + primitives. With a number like that…

If I understand the problem, raytracing would be pointless because the spheres are being created as a stream so it’s not like you’d ever store them all in memory. And also you’re only projecting them onto the screen once, so building some ray acceleration structure would probably cost way, way, way more than just rasterizing them.

Rasterizing in CUDA would work, but boy it’d probably suck. You’d have to use slow global memory writes, but even THOSE would be wrong because of thread collisions (what if two threads are both writing to the same pixel?).

Ouch, a tough problem. I’d expect using the GPU’s rasterization hardware would be best, that’s EXACTLY what it’s designed for, streaming primatives onto a projection plane and compositing them. It’d be easy enough to try with an OpenGL shared buffer, but I guess that’s what the original question was about, and it’s interesting if that’s really the best way to do it.

Not sure of the exact case here, but if you can implement a crude depth-buffer in CUDA (maybe use blocks as screen-tiles), you can cull a lot (quite a lot if they’re spheres of any non-trivial screen-size) of your input before you send it to OpenGL. I think such shortcuts are not optimizations at all, they are important decisions for your application to work. As a very simple example, you can trivially reject any sphere that is completely hidden behind another (assuming nothing is transparent). A successful combination of such strategies here should prune down your input to the rasterizer incredibly. Once that is done, you can take your pick of rasterizing in CUDA or OpenGL.

As I understand, CUDA-OpenGL interop can only go faster and you can reliably assume it will go closer to negligible with hardware and driver refreshes. But from my experience of the current situation, if you use large VBOs you run the risk of throttling performance.

Good luck!

But doing a depth-only pass is a virtually identical procedure to doing a full rasterization. I can’t think of an elegant way that you could do it using simple math. Sure, it’s easy to test if one circle completely overlaps another, but checking each new circle against the ten thousand others you already know are onscreen wouldn’t be too hot (at least on the GPU. a serial algorithm might do ok with an appropriate acceleration structure).

Well, here is one way to do simple rasterization: You use shared memory as the framebuffer, you encode your depth and color value into an integer (upper 16-24 bits are depth, lower bits are color), and you use atomicMax() (G200 only) to do both depth test and render into the buffer. Each block will have to re-read all the vertices, which is unfortunate, (maybe cmem could be used?), but overall this operation will be very fast.

But here’s what I thought of when thinking abou this: What do you expect to actually SEE? It will be a complete mess to have a billion spheres onscreen, are you sure you’re going to get something out of it?

Alex has right.

To get the best performance on CUDA hardware you must look at this problem as problem of sorting in two ways.
Suppose your final rendered picture is divided into squares (or rectangles doesn’t matter) where each rectangle is rendered with single “unit”. I said unit because it can be a single thread, warp, or block and depend on size of your output image as well as a number of GPU you are using.

Image those squares in 3D, all of them are actually scaled boxes in 3D having their depth.
Now you can provide a sorting algorithm which will sort your particle position array into many (rectangles count) smaller arrays where each represents every particular box 3D. After that every array contains only particle from single box That is the first sorting.
When testing does particle inside a box you have several ways you sould take care of following:

  1. if your particle is just pixelsize you use standard testing against the box size
  2. if your particle is sphere or sprite (meaning it has width and/or height) then radius must be taken into account because a sphere with center belonging to first box can overlap with it’s radius to second box. In such situation both boxes (arrays) MUST contain it. When I said radius it means visible not real radius. Suppose all your spheres are with radius 5, but after rendering some appears larfer and some smaller on screen depending on it’s distance to camera. Such radius (lets call it VisualRadius) must be calculated and added to sphere position before TestInsideBox is performed.

After that you should do second sort in every particular box where particles are sorted by depth (from camera) value. To avoid expensive space transformation calculation you can provide your original particle array already defined in camera space. In that situation your second sort will be simple sorting by pos.z value.
Now rendering is simple even if your particles are transparent. and you can do it on two ways. Which is better depends on your particle.
Newbees can not see difference between them on CUDA. But difference exists so to be more obvious I will explain it with serial algorithm.
First would be to have loop over particles starting with farthest and inside that loop would be loop for drawing (doesn’t matter if it is hardware or software second loop exists from the aspect of algorithm) ie. sprite passing through each pixel in sprite and drawing it.
It is called rasterization.

Second algorithm would be one loop through pixels on screen and loop inside that loop would be loop through particles (for each pixel). Called “RayTracing”. (quotes are typed because real RayTracing involve many more calculations and testing, ray bouncing etc. this one simple is called scanline where only one ray from camera is parsed)

Software implementation of algorithms will result with same image for both algorithms BUT if execution would be interrupted on each step of external loop you would see the difference. After single loop execution in first algorithm you would see a farthest sphere on image and in second algorithm you would see left topmost pixel shaded.

To accomplish best performance on specific architecture you must take in account architecture specifics like share memory amount,
level of parallelism, coalescing memory access etc.
Implementation of these algorithms in CUDA would be:

Firstone would be to draw every particle starting from farthest. Drawing should be performed in shared memory in such way that all threads are actually doing drawing! For example suppose you have spheres, which will be draw as circles. After calculating 2D position of sphere’s center inside rectangular region (in which every pixel is calculated by single thread) and calculating VisualRadius then every thread test just single pixel to “see” is it inside that circle or not and do shading of that pixel. Simple. Also if you have transparency involving then CurentShadingPixelcolor = OldColorInSharedMem * (1-Blending) + Blending * ActualPixelColor and everything is performed in shared memory and can be done without memory conflict. Blending is your transparency factor. Fast.

Second algorithm: Difference is described above and now you’ll see why it is not obvious for a newbee. External loop over pixels (actully there is no loop through pixels because every pixel is handled by single thread.) Inner loop where every thread iterate through spheres starting from farthest, calculate it’s VisualRadius (if hasn’t calculated yet) and test is it inside that circle. If yes then do shading of that pixel. This algorithm also can be performed without memory colision!!! because each thread access to the same sphere data in each iteration so there is no conflicts.

When render is complete just move your data from shared memory to texture data and display it by OpenGL.

Let me know if this helps.

Rasterization in CUDA is indeed possible, but as the good discussion above shows, it’s complex.

This problem is actually made a little harder because you have billions of spheres, and you just want a one-time projection.
Ignoring multi-GPU, this is going to be painfully memory bandwidth limited.
Let’s say we have the particles in global memory. One rasterization strategy is that every block is a small rectangle of the screen. Each block reads the whole particle list and uses shared memory as a small little RGBZ buffer… something like a 32x32 tile. So for a 1Kx1K image, you’ll need 1000 blocks. This also means that each particle needs to be read by each block, so each particle is read 1000 times. If a particle is 16 bytes, then
1B * 16* 1000 = 16Terabytes of bandwidth. Ugh. It’d work, but boy it’s ugly!

So perhaps you could presort them… but now you have the classic histogram/binning/gridding problems since it’s pretty hard to partition a list into bins, mostly because it involves dynamic memory management in GPU runtime, unless you know some solid facts about the object distribution and limits. The main idea is that when a particle is generated, it’s sorted right then into the proper rasterization bin for the second pass. If you have 500 blocks generating particles, they’d each need 1000 bins so they don’t collide with other blocks writing… ugh! So memory handling is ugly. Global atomics could work to let blocks share bins, but those are painfully slow, and a billion atomic calls is likely an unpleasant overhead. That’s worth trying though, I just don’t expect it to be efficient. (It’s also simplest to code.)

So, make 1000 empty memory lists, one for each small tile of the screen. Each block generates its particles, decides which list to write the particle into. It uses global memory atomic increment to get a unique slot to write the particle into. Repeat.
Then start kernel #2 which rasterizes. Rasterize that block’s list into its private shared memory zbuffer. This will likely be painfully inefficient, since you can’t easily do more than one particle at a time (unless you tried shared memory atomics, per pixel? ouch!)

Wow, I see why it’s so appealing to just dump it all into the hardware rasterizer. That’s exactly what it’s designed for. The OpenGL method may not be efficient but I think it’s going to beat a CUDA rasterization core.

No, you do not need atomics!

Because each value is accessed and updated from only one same same thread!

Each particle is read by one thread, that’s true.

But each thread is writing to the shared memory RGBZ tile, and those writes can collide. Ugh!

Shared atomics can help fix that. There may be other tricks too but it’s certainly not automatic or trivial.

In fact the subproblem of rasterization within a block into that block’s shared memory is fraught with problems all by itself!

The classic algorithm would be:

thread reads a particle.

For each pixel the particle covers, compute the Z depth for the particle’s.

If the particle’s Z is closer than the pixel’s Z, update that pixel’s RGBZ.

That last step is the nontrivial problem. Because you need to atomically update FOUR values, so even an atomic shared exchange won’t be sufficient.

Instead you’d likely assign mutexes (implemented as either atomic swaps of an “ownership” variable, or using a sidechannel write then syncthreads, and the winner of the write owning the mutex. Wow, it’s just ugly and painful!

Another method is crazy but would also work, and that’d be to double-render, the first is Z only, then render AGAIN and update the RGB if your (recomputed) Z matches the buffer. That avoids atomics and mutexes but requires two passes.

An amusing reference to the technique: http://jgt.akpeters.com/papers/HainesWorley96/.

Hmm, thinking about it, this second method isn’t so crazy, You z buffer particles into it, writing Z-only. If your write “wins” you remember the sucessful RGBZXY values in the thread’s registers. Repeat for a dozen pixels or so (depending on your register count), then syncthreads and switch over to RGBZ writes from those stored registers, only if your stored Z matches the buffer’s Z. This is using the actual Z value itself as the mutex voting variable, so they’re all done efficiently and in parallel. Syncthreads again, and you can scan the next batch of incoming particles.

Wow, such a simple problem has such implementation complexities!

You could probably just ignore the write conflicts entirely. Sure, you’d get some aliasing, but in general, screen space is large enough to make collisions fairly rare. I know that I’ve used this approach before with monte-carloesche flame fractal rendering, with no visible problems, but then that just uses straightforward additive blending. Still, I doubt that many poeple could look at a field of particles and notice that there is indeed some RGBZ-fighting going on.

haha, writes can colide only when my son learn to write letters or when I write english words… if you write bad code and force them to colide then yes… but I am pretty sure you know you can evaluate a color even in register!

Particle pos array is in global mem (already sorted by regions and by depth as I described in upper post and “tuned” during sort).

Each thread is assumed to represent one pixel (ray) and do following:

[codebox]

integer BlockIndexInGrid = gridDim.x * blockId.y + blockId.x;

integer ThreadIndexPerBlock = blockDim.x * threadId.y + threadId.x;

integer tid = blockDim.x * blockDim.y * BlockIndexInGrid + ThreadIndexPerBlock;

float3 invblending = 1-blending;

float3 color={0,0,0}; // initialize background color in register

float ParticlePosx , ParticlePosy, visualradius;

for(unsigned int i=0; i<LocalParticleCount; i++){ // loop particle array in global mem starting from farthest

// since all threads in block use the same particle data

// fetching it in shared mem would have enormous speed gain

// but this is not intend to be final code just showing concept for rasterization

ParticlePosx = gmemposx[i+blockOffset];

ParticlePosy = gmemposy[i+blockOffset];    

visualradius = gmemvisrad[i+blockOffset];  

visualradius *=visualradius; // make it power 2 to avoid latter sqrt

dx = threadId.x - ParticlePosx;

dy = threadId.y - ParticlePosy;

distsquared  = dx*dx + dy*dy;

if(distsqured<visualradius){

color.r *= invblending;

   color.g *= invblending;

   color.b *= invblending;

color.r += particlecolor.r*blending;

   color.g += particlecolor.g*blending;

   color.b += particlecolor.b*blending;

// on each iteration, one sphere is calculated and composed adding it’s premultiplied color value to the background

}

}

integer stdcolor = (trunc(color.r255) << 16) | (trunc(color.g255) << 8) | trunc(color.b*255);

GlobalMemImage[tid] = stdcolor;

[/codebox]

This example is just concept and assume pos array is “tuned” inside sort algorithm, so positions are already precalculated in local 2D space for each rectangle during depth sort (depth is just ratio in scaling between near and far clipping plane, easy and no expensive calculation) and VisualRadius is also calculated because it was necessary for proper sorting by regions.

The only limit is amount of device memory. (not for new Tesla)

Hi,
i’m reviving this old thread because i actually wrote an emulated rasterizer with CUDA :thumbup:

Here’s an screenshot:
Untextured Render

The stats:
-50 FPS
-Nvidia 8600 GT
-345000 polygons in 5 batches
-2 deferred lights with specular (not so visible)
-rim lighting
-1024*1024 resolution
-rasterization takes up to 75% of frame time (6600 ms per batch)

So, after these tests i can say that the only viable method to achieve rasterization in CUDA is by bruteforcing a thread for each primitive;
this because even if you get locks, even if the threads completely diverge etc, you will always have much less visible primitives than pixels (even if this is not the case of the OP).

So, it is true that one can set up a really simple kernel for each pixel… but the performance is in fact worst because of the binning overhead (one of the heaviest algorithms) and because of the sheer number of pixels, more than one million at 1280*1024 resolution.
So one heavy kernel that runs on 1/3 of the threads is actually faster, and by much: with a binning per-pixel rasterizer i was getting 15 fps and it wasn’t even complete, and the binning was done offline on the CPU.

To not mention the fact that with a per-pixel rasterizer, you would have to recompute and reread the interpolation values (3 colors + 1 3d coord) for each single pixel…
while in the first case you read the vertex color once per thread and recompute the coords once per scanline.

So, imho CUDA rasterization isn’t viable in real world applications for now, just use DX or OGL.
Anyway things might change with GT300 and MIMDs, or with Larrabee, or just with Nvidia exposing Rasterizer Units to CUDA (why they aren’t anyway??)

The topic itself is ludicrous. Why would you want to render 50 Billion spheres? What kind of simulation is this? A galaxy? Blizzard storm?

Well if it is a galaxy or blizzard, I would try to render only “spheres” that are close enough to the camera (or big enough to affect more than one pixel) to be actual spheres. But everything else should be a point with averaged color. And just a billion points is still a lot to handle, so add up a small alpha value (opacity part) for each pixel which has those tiny “spheres”; in a raytracing-like way, and then render it to a plane for each of the 6 purpendicular directions (making a skybox that you can update however often you want). Since you don’t want to nastily update every frame, warp fields of quads (for each plane) to emulate fast changes in the scene). Also, 50 billion is still crazy. Zooming up or moving through that 50 billion enough that you would actually want to count in every sphere for the render would be absolutely pointless, the detail would only change so rapidly that it would just be a bunch of flashing colors. I would only dare to do 10 million (much less than 50 billion!) as an absolute maximum. I could only imagine you needing about 3-5 million.

But anyways, there, it’s good. Have fun.