My first question is rather simple, what does launch actually optix create when you render a 1024 x 768 size?
Reading about cuda indexing I see that there is:
- a grid with x,y,z dimensions containing blocks
- a block with x,y,z containing threads
- and threads
You do not need to be concerned about that because OptiX provides a single-ray programming model and all scheduling of that to available GPU hardware capabilities is handled internally.
You only need to care about the optixLaunch
arguments and the optixGetLaunchDimensions
and optixGetLaunchIndex
to do work per launch index.
Note that the OptiX launch dimension is limited to 2^30 which is smaller than in native CUDA. See the Limits chapter inside the OptiX Programming Guide.
These launch indices are effectively CUDA threads which are running in warps of 32 threads. How many blocks are used internally depends on the amount of resources being used.
You could be concerned about occupancy in warps when programming your kernels (the more divergent code is executed in threads in a warp, the lower the occupancy, the worse the efficiency of your kernel). So program device code in a way that most code does the same thing when possible.
The other thing which affects the scheduling is the number of registers a kernel is allowed to use and the default in OptiX is 128 because there is usually a performance cliff when going higher, but in a few cases, depending on the complexity of the device code and the underlying GPUs, higher values can make sense when there is too much register spilling when allowing too few registers, so there is a setting in OptiX to experiment with that number of registers. (See link below.)
Mind that this is per GPU and you should not change the default blindly for all GPUs when you’re not able to verify the effect. I recommend to not touch the default before you’ve optimized everything else.
You will be able to see the occupancy and number of blocks OptiX kernels launched inside an Nsight Compute profile summary.
Read this post for more details: https://forums.developer.nvidia.com/t/high-stall-mio-throttle/274590/4
I am not sure but when you render a1024 x 768 image, do you simply create a single1024 x 768 block with 786,432 threads?
Nope, that’s not how the grouping of threads into blocks works. If you read the CUDA Programming Model chapter inside the CUDA Programming Guide again, you’ll find this sentence: “On current GPUs, a thread block may contain up to 1024 threads.”
My other question has to do with handling the hit event. Let’s say a ray hits the triangle and I want to store all the x coordinates of the rays that had a hit. How can I transfer this collection of x back to the main function?
I’ll answer that inside the other thread with the same question.