Here is some CUDA code (work in progress).
Contrary to Roger Alsing’s approach, I am not using alpha blending - but rather strict linear superposition of the polygons. This means each polygon adds to the resulting image without affecting previous polygon contributions. This linearity will allow for a better optimization of each individual polygon (while leaving the other polygons fixed).
Rendering is done in a CUDA kernel, currently in a 512x512 RGB matrix. Rendering works on convex polygons, each having a signed 8 bit RGB color - meaning a polygon can either add to the color output, or subtract from it. Superposition of polygon RGB colors is currently done in the floating point domain, before an 8 bit RGB clamping is applied for final output. I found working in floating point to be a little faster than working with ints. I intend to compute the error function as part of the rendering process before the RGB clamping. So I can abort rendering early when the error function goes out of bounds (i.e. the mutation was bad).
The current code sample renders 3 triangles (the middle one subtracting from the final output by using negative colors). The sample is based on the “Box Filter” SDK sample and uses OpenGL. It should compile also on Linux if you define TARGET_LINUX instead of TARGET_WIN32 and apply a modified “BoxFilter” Makefile from the CUDA SDK 2.0
Now my problem is that this kernel only does “only” ~900 FPS on my nVidia 9600 GSO when rendering just three polygons. I need some advice how to make this rasterization faster.
You can still enable the box filter with the + , - and [ , ] keys. It brings down the FPS even more but softens the edges.
UPDATE: The most recent addition to my renderer is that it computes an error metric during rendering (which can be easily converted into a PSNR). The FPS values reported are now much more accurate.