I’m using Optix Prime in a soft real time system where the CPUs and GPUs are shared by multiple processes. I am trying to optimize resource usage to free up as much CPU as possible while raycasting is happening on the GPU.
I tried using cudaDeviceScheduleBlockingSync and cudaDeviceScheduleYield but that didn’t seem to help. To be clear, I called cudaSetDeviceFlags(cudaDeviceScheduleYield) in my main just before creating the optix context with rtpContextCreate(). I measured using ftrace that the CPU was no yielding while the GPU was running (CPU time == wall time for my raycasting function).
So I undid that and instead used RTP_QUERY_HINT_ASYNC and a poll loop on rtpQueryGetFinished with 500us sleeps in between. This does to help: the CPU time is now much smaller that the wall time for my raycasting function.
I was wondering if there was a better way that wouldn’t involve a poll loop, but instead rely on some interruption based mechanism so that the CPU would sleep until the GPU job was finished.
When you used cudaSetDeviceFlags(cudaDeviceScheduleYield), were you calling rtpQueryFinish() immediately after rtpQueryExecute()? I believe cudaDeviceScheduleYield won’t automatically yield on launch, it just controls the behavior of what happens when you wait on a cuda event. rtpQueryExecute() is inserting an event into the cuda stream, and rtpQueryFinish() will wait on that event. If you don’t wait on an event, then your CPU will proceed executing CPU code after launch, which could look like a failure to yield. You could also try inserting your own event in the cuda stream and wait for it using cudaEventSynchronize().
Thanks David for your help. Unfortunately I’m still struggling with this.
my rtpQueryExecute call is followed by a cudaMemcpy call to copy the hits back to the host. I don’t have any CPU work I can do in between the 2, so I’d like the CPU thread to yield during that time. cudaMemcpy should be blocked by the completion of the query so I feel that rtpQueryFInish is not required.
I tried a bunch of stuffs:
- setting / not setting cudaDeviceScheduleYield / cudaDeviceScheduleBlockingSync
- using / not using RTP_QUERY_HINT_ASYNC
- calling / not calling rtpQueryFinish
and several combinations of the above …
so far my only success was to not set any device flags, doing an async query, and adding a while loop on rtpQueryGetFinished + usleep
I’m not very familiar with ftrace. Just to satisfy my own lack of context, do you have examples of code that is yielding where ftrace registers the yield properly?
One thing that might help is to run Nsight systems (https://developer.nvidia.com/nsight-systems), where you can get a timeline of both CPU and GPU work in tandem. This will help verify that the CPU is not yielding properly and might also reveal what the problem is, if there is a problem. I’ve asked the primary OptiX Prime developer about your issue, and he said there are no known issues between the CUDA sync primitives and OptiX Prime, they should be behaving as expected.
BTW, what are your OptiX version & driver version & GPU model numbers? You are on Linux, yes?