I am trying to run a computation on the GPU concurrently with my optix kernel. I would like these kernels to be running at the same time, so that they can send messages to each other via an atomically-accessed global memory queue. They are launched on distinct, non-trivial, non-blocking streams.
However, I am encountering some issues when introducing shared memory into my concurrent kernel. Specifically, if I have enough thread blocks to saturate my device and am using more than 512 bytes of shared memory, the kernels synchronize and subsequent optix launches wait for the CUDA kernel to terminate.
My system specs are: Ubuntu 22.04.3 LTS x86_64, GeForce RTX 4080 16GB, Driver 535.129.03, CUDA 12.2, OptiX 80.0.0, GCC 11.3.0
I’ve forked the optix apps and added a very simple minimal example of this effect, which is immediately visible in the runtime: when the kernels synchronize, they have to wait for the long-running kernel to end and FPS tanks. When they can run concurrently it will exit normally at 60fps.
with a very small diff:
I am not sure what could be causing this behaviour. Is there something in the scheduler that discriminates against optix kernels? Could there be a parameter that optix is setting internally that overrides the L1/Shared tradeoff and conflicts with the concurrent kernel that requires shared memory?
How can I ensure that my kernels will run concurrently?
I don’t know or have a definitive answer yet, but I’ll ask around the CUDA team. I don’t think there is anything in the scheduler that’s biased against OptiX kernels, they are mostly just CUDA kernels from the hardware’s perspective. There might be a launch config conflict between your CUDA and OptiX kernels, the shared memory size might indeed prevent your OptiX kernels from sharing the same SMs. You might get some useful info out of Nsight Compute, if you profile these kernels separately and look at the shared memory configuration.
FWIW whenever I’ve tried to have kernels running concurrently, I have usually observed the behavior that they end up mostly serializing anyway, with some overlap near the beginning or end, but not a lot of overlap in total. If a kernel saturates the GPU, it tends to hog the GPU until there are larger openings for other work. This might be the right choice for maximizing throughput.
I’m completely speculating here and have not tried this myself - maybe you can set your launch dimensions such that each kernel only has half the thread blocks that the GPU can execute concurrently, so the kernels are unable to hog the entire GPU. That might work and give you lower latency for the message-passing, but it might slow overall throughput down as well, so it might or might not be a good tradeoff for you.
I would emphasize again that (even when future drivers will support __threadfence in OptiX device code, which they will, I just don’t know the driver version, yet) this communication idea between an OptiX kernel and native CUDA kernel will never work as efficiently as the previously recommended implementation of a wavefront renderer where OptiX does the ray-primitive intersection and the native CUDA kernels do the ray generation and shading with all native CUDA features you want. Please give this a try.