I was wondering if I launched 100 million rays, I would assume all 100 million rays wouldn’t be traversed in parallel. There has to a limit which would fill up the SMs and RT Cores such that they would be scheduled to run after rays finish. Now how would I go about finding that number for my hardware setup.
Let’s differentiate the launch dimensions from numbers of rays.
The optixLaunch dimension arguments width, height, depth define how many threads are started.
In OptiX, this number is limited to 2^30, see the Limits chapter inside the OptiX Programming Guide: https://raytracing-docs.nvidia.com/optix8/guide/index.html#limits#limits
Depending on the raytracing algorithm implemented inside your device programs, each of the launch indices (threads) can call optixTrace multiple times, so the number of rays is usually not equivalent to the launch dimension. That’s only the case if you have exactly one optixTrace call inside the ray generation program without a loop and nowhere else.
Now how that number of threads is mapped to the actually available resources of the underlying GPU depends on the resources used inside your raytracing kernels and the scheduler implementation inside OptiX.
Since OptiX is implemented in CUDA, you might want to have a look into the CUDA Programming Manual how that is grouping threads into warps, blocks, and grids.
How many individual cores are available on your GPU can be found inside the GPU specifications.
There are some Wikipedia sites which summarize that, e.g. like here: https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture)
When analyzing your OptiX device code with Nsight Compute, you will be able to see the resource usage of your kernels, the number of blocks, or the warp occupancy of your own device functions. How the individual rays are scheduled onto the RT cores isn’t going to be exposed though.
If you’re profiling your application for performance, start with Nsight Systems to find bottlenecks due to synchronizations or memory transfers first.