How might one implement an efficient Z Buffer in CUDA?
I am wondering if the standard raster code accesses special hardware features not exposed via CUDA yet?
I would like to implement some rendering in CUDA and make use of a Z Buffer for both occlusion queries on bound primitives, and raster style depth testing of pixel fragments.
I have not got too far with test implementations but scattered reading and writing of the global memory is quite slow. Maintaining a Z pyramid or similar and perhaps making use of cached memory might help.