Efficient Z Buffer in CUDA?

How might one implement an efficient Z Buffer in CUDA?

I am wondering if the standard raster code accesses special hardware features not exposed via CUDA yet?

I would like to implement some rendering in CUDA and make use of a Z Buffer for both occlusion queries on bound primitives, and raster style depth testing of pixel fragments.

I have not got too far with test implementations but scattered reading and writing of the global memory is quite slow. Maintaining a Z pyramid or similar and perhaps making use of cached memory might help.

The rasterizer hardware is not exposed in CUDA, but it is possible to implement rasterization by bucketing triangles into tiles and then rendering into shared memory. This talk at Siggraph did something similar:

“Single-Pass Depth Peeling Via CUDA Rasterizer”

This would be the jackpot for everyone who does 3D computer vision!