Hi. I am developing a raytracer based on cuda and need some help. I have the implementation completed, but I do not know what to do next. First, it runs a kernel with x number of threads per block, and y number of blocks, where x equals the width of the screen and y equals the height of the screen. Then, it calculates the color of each pixel and stores in a 3-dimensional float pointer. Now, I am unsure of what to do next. I want to use OpenGL to draw the pixels on the screen, but I don’t know if I should have my method that executes the kernel return a float pointer with the pixel definitions, or if I should use the OpenGl interop and map the pixels to a buffer. Whichever way I choose though, I do not know the pixel pointer should be formatted. Any advice?
A bit OT but that might not be a good way to distribute the work. Too many threads in a block cause very few blocks to really run in parallel. It has to do with warp size IIRC. A better approach would be to have less threads per block but more blocks.
I think there’s a “sweet spot” in the number of threads per block that saturate the GPU without screwing up the warp size. IIRC it was 192 threads per block but I may be off here? That also depends on the register usage per thread and probably some other things I don’t remember right now.
If you could use a smaller and constant number of threads per block, you’d probably get better performance. How about tiling the screen? You can use 192 pixel tiles. For 1600x1200 resolution, which gives 1’920’000 pixels = 10 000 blocks. You should get much better performance.
I’m afraid I can’t help you with displaying, I’ve never done that :)
There are no absolute rules for the best number of threads per block. But, barring any algorithmic constraints, some magic numbers are:
32 threads: This is the size of a warp, and if you are using less than 32 threads per block, the stream processors will not be doing anything some fraction of the time. (This is true, even though there are only 8 stream processors per multiprocessor. Things are pipelined.)
64 threads: At this point, you avoid register bank conflicts. (This suggests that register memory is a lot like like shared memory.)
192 threads: With this many threads, you ensure that immediate read-after-write of a register does not introduce a pipeline stall.
256 threads: This is the most number of threads you can have and still achieve 100% occupancy, 768 active threads per multiprocessor. (Assuming resources permit 3 blocks to run per multiprocessor)
512 threads: The max number of threads you can have per block.
That said, unless you are really fine tuning an algorithm, you should ignore everything but (1). You want a multiple of 32 threads per block. If your algorithm permits, make the threads-per-block something you can easily change with a parameter to your host function, and then benchmark many block sizes in steps of 32 on some real data.
What is the register bank conflict. How to prevent register bank conflict with 32 threads configuration
I have no idea beyond what is mentioned in the manual. All it says is that to avoid it you need 64 threads. I wouldn’t worry about it, since similar conflicts in shared memory have a small effect on performance.
to the op’s question, i used opengl interop and had cuda wite to a shared buffer. look at the mandelbrot example, it’s pretty straightforward.