Hi all,
I’m looking into ways to speed up some analysis. I have a set of viewpoints (lets say 1000) from which I trace an image and count unique materials hits. My current implementation is to loop through my viewpoints, launch analysis for each point (all launches share the same settings except Position/Direction/Up) and then write to buffer. But this does not seem to utilise my RTX GPU fully in low resolutions so I’m thinking of ways to do larger launches.
The best implementation I can come up with would be the put all my viewpoints in a buffer in the launch parameters and then use the depth parameter in OptixLaunch to tell my raygen program which viewpoint to use and where to write the result in the output buffer.
Just wanted to check if this is a feasible approach, if there might be other options or if there’s anything else I should consider.
Thank you.
The best implementation I can come up with would be the put all my viewpoints in a buffer in the launch parameters and then use the depth parameter in OptixLaunch to tell my raygen program which viewpoint to use and where to write the result in the output buffer.
Yes, you effectively do a 3D launch with the width * height * number_of_viewpoints and render all of them at once.
That is perfectly fine, if the launch size is not exceeding the maximum supported size which is roughly 2^30 launch indices overall.
CUDA native launches can be bigger and are queried with these CUDA device properties.
Depending on the target system you’d also need to be careful to not run kernels for too long. Windows OS since Vista have a 2 second limit for the Timeout Detection and Recovery (TDR) mechanism in WDDM(2). Make sure you’re staying under that threshold per launch on older boards.
This does not affect the NVIDIA TCC driver mode, that is when dedicating boards for compute tasks.
How small are your “low resolutions”?
There is a minimum amount of threads per launch required to saturate different GPUs (depends on what GPU you’re using).
Anything below 256x256 is most likely too small on high-end boards.
OptiX 7 launches are asynchronous. So if you’re above that thread amount you should be able to submit multiple asynchronous launches and get the GPU busy enough.
That requires some careful asynchronous updating of the launch parameters though. I hinted at that in my example programs.
If you’re getting the resulting data back, maybe there is too much device-to-host transfer happening between launches?
To see how the overall application performs, including what is happening between optixLaunch calls, you could profile that application behaviour with Nsight Systems.
For the kernel launches itself you can profile the kernel with Nsight Compute and see what the main bottlenecks are, that’s normally memory accesses. (Compile your input PTX code with line-info.)
More information here: https://developer.nvidia.com/tools-overview
Thanks for the pointers!
How small are your “low resolutions”?
Something like 512x256. I run on a couple different GPUs but I want to max out on an RTX 8000. The 2^30 (which translates to max launch 1024x1024x1024?) indices and 2 seconds limit might force me to split my larger launches. Maybe push everything async is the way to go.
There’s some transfer there but only the buffer which would hold about 5-10 integers. I could probably have just one buffer, pass the same pointer to each launch and only load from device to host after all launches are finished. Will look into Nsight for profiling further.
Just thought I should give an update here.
Launching multiple viewpoints at once while making sure that no launch exceeds 2^30 has now cut my analysis time in half (from 400ms to 180ms with 2000 viewpoints of 500x250px on RTX 6000). Would say most time is saved in the device to host/host to device transfer (fewer downloads of output buffer and fewer uploads of launchparams).
1 Like