CUDA Raytracer

Hey guys,

I’m developing a raytracer on CUDA. For supersampling, I use the shared memory to store the sampling values of one pixel. So I compute each pixel in its own thread block. At the end of the kernel I wait with thread 0 for other threads in the block to complete and then summing up the sampling values. I think this isn’t very efficient.
Normally on the cpu, each core gets a bucket (a small region in the image) to render. But that wouldn’t be possible on the gpu, since the watchdog gets activated after a few seconds operating on the gpu. Does anyone have a better idea for doing supersampling with CUDA??? I would appreciate any suggestions.

Thanks in advance.