My own graphics pipeline with CUDA Is it possible?

Hi, guys.

I know a little about CUDA. :book: I wonder is it possible to implement a custom graphics pipeline with CUDA? …send vertices, indices and rasterize triangles? (in realtime of course)

Thanks, Daniel

I’m sure it would be possible, but not necessarily efficiently. If you need to rasterize triangles, why not use the old and tried pipeline?

I think it may be future prospect… Old and tried pipeline has a lot of limitations. We have to complicate our algorithms to fit in vertex processing, pixel processing. I would like to see graphics pipeline as a set of abstract processings. For example, vertex transformation procesing, vertex N-patch processing, …, pixel processing,… :rolleyes: Certainly, I can assume that pixel processing == a processing and create custom pipeline (now I am researching this approach). But I guess this approach has overhead. So I was thinking about a custom pipeline based on CUDA…

I am in the same situation as you. I need a custom pipeline. Have you made any progress? Found any other information online?


While I admire your vision - until you can make MPs more efficient than ROPs in terms of speed vs. die space, you’re always going to be less efficient (and thus generally slower) in terms of raw rasterization speed.

sounds cool. I know there has been some work using wavelets for rendering; I honestly don’t know the first thing about it though (but I assume it’s different than the typical rasterization scheme).

You might find this paper interesting:

“Real-Time Reyes-Style Adpative Surface Subdivision,” Patney, A., and Owens, J.D. ACM Transactions on Graphics, 27(5), December 2008.

It combines CUDA processing with OpenGL - not precisely what you are suggesting, but it seems to be a step in that direction. There is a nice descripton, a link to the paper, and some other relevant links on this page:…implementation/

Jeremy Furtek

also related:

Wow, at last someone that had my same idea :rolleyes:

I am working on a full-CUDA graphics pipeline right now, and i think that it goes beyond most of the traditional limits (early z-fail, batch count, scattering) and for now it is interesting.

For example, a natively deferred rendering pipeline should manage to extract good performance from the hardware even in CUDA, because apart of rasterising (which is inherently non-parallel) any other screen-space pass maps naturally to the CUDA architecture… so that it should not create bottlenecks.

For now i got wireframed meshes (the ever-famous bunny) and should be easy to get depth and normals into the pipeline.

Anyway i don’t expect it to be “productive” anytime soon!

NVIDIA spent a lot of die space on ROPs and probably a few other thingies to do fixed-function graphics pipeline. ROPs aren’t available through CUDA so you’re underutilising the hardware already.

That’s unless you come up with a pipeline that somehow works without using whatever ROPs are doing, relying (almost) exclusively on SPs. I wasn’t sure what they actually do, so I googled and found “ROPs handle anti-aliasing, Z and color compression, and the actual writing of the pixel to the output buffer.”. I’m not a graphics whiz, perhaps raytracing/radiosity etc. can do without this?

Otherwise, I’d look into DirectX 11 and its compute shaders. If I’m not misinformed, they should bring CUDA-like compute kernels and integrate them with the normal pipeline (with other shaders and fixed function ops).

This is true, but anyway i’m doing this for a “research” purpose and not to have an engine useful to make games today.

If i was aiming for that, i could as well skip DX11 and use Dx9 like everyone else :rolleyes:

Anyway i think that ROPs will be the next things to be swallowed in the general-purpose SP:

in the same link you provided, it’s written that “The move towards fewer ROPs than fragment pipelines is a way gpu designers eliminate unneeded complexity from their chips without sacrificing performance”…

and there was a recent article that demonstrated how you could write a rasterizer for Larrabee nearly as fast as the hardware ones (too bad that i can’t find the link)

So i think that obviously ignoring ROPs is bad on current hardware, but things could change in the future.

How is rasterizeration inherently non-parallel? It’s perfectly parallel.

Hmm, i think it looks parallel but it isn’t really… given each triangle, you have to determine which pixels it contains.

The problem with this is that a triangle can contain ANY number of pixels: none, one, or even the whole screen.

So, the worst case of a simple implementation would have one or two nested for that always diverge, and their memory writes can’t be easily coalesced being scattered on the whole final memory… making it a nightmare for current architectures.

In fact Big_Mac is right, and probabily the viabilty of a CUDA pipeline depends much from the impact of the rasterizer.

Forgive a stupid question, but can’t it be somehow done the other way? Ie. for each pixel, find a triangle that it is contained by (probably one that’s closest to the viewer, doing some z-sort along the way)?

This sort of reminds me of how I did Voronoi tesselation on CUDA the other day. Instead of:

for each voronoi point

   loop over pixels

	   if pixel is closer to this voronoi point than any other voronoi point


I went with

for each pixel

   loop over voronoi points

		find the voronoi point that's closest

		assign its index to self

This was quite effective in CUDA (compute bound). It still took 15ms to tessellate a 1024 by 1024 image with 200 Voronoi points (on 8800 GTS 512), but it could be done smarter.

I’m sure it can be done in some other way, but for sure it won’t be a “simple” method…
for example examining each triangle to see if it contains the pixel is not an optimal solution…
most scenes are easily around 300.000 - 500.000 polygons today, and a for with such number of iterations (while not-divergent) would take just too much to execute even if it contained just an assignment.

Z-sort could even make things worst, because the sort would heavily impact memory, and then you have a really common worst-case: the sky box.
Sky box occupies often big parts of the screen, and it is by definition the farthest thing… so you still need to traverse all the triangles before to assign a pixel to it.

So, it could be done in another way, but i guess it has to be really smart :shifty:


I am about to start implementing CUDA for triangle rasterization in memory.
As long as parallel processing is concerned, we still have a way if we are using scanline method instead of flood-fill method to rasterize the triangle. A lot of examples denote two steps for rasterization. The triangle is split into two triangles. First step rasterizes the splitted top triangle and then the splitted bottom triangle is rasterized. So, there is space to rasterize the triangle parallely which will reduce the time by half.

Ofcourse, Z-Buffer has to be implemented in order to avoid sorting of triangles.

Hasn’t anyone worked on this. Is it possible to share an example if someone has used CUDA for polygon rasterization.

Thanks in advance.

Keep smiling,

Any one succeded in building a graphics pipeline with atleast rasterization technique ?