currently i am implementing a ray tracer via cuda.
due to the reason that every pixel is one thread i have a questions.
does it make sense to store the pixel results in the shared memory per block first and copy them afterwards to the pixelbuffer
or is writing directly to the pixel buffer the better choice?
You could do it either way. If your threads all take variable amounts of time, it may be more efficient to have them write to shared first, then at the end the whole block writes to device memory at once. The savings with that method is in reduced device memory write transactions (incomplete coalescing). That may or may not be your bottleneck, though, likely depending on the speed variability of each thread’s work.
Easiest, however, is just have every thread write to the global memory buffer. That’s simplest, and you could always change it later if you find it’s a bottleneck.
ah great! thanks.
seems like i have to find out more about coalescing! but everything you said sounds plausible.