Optix 6.5 - interleaving CUDA kernels


I had a couple of questions regarding interleaving Optix and CUDA kernels.

I am computing my image in a few steps (in a loop, the image is computed within a single frame), by repeatedly calling two distinct Optix kernels (ray gen programs):

while(some condition)
    if (initial kernel)
        context->launch(ENTRY_POINT_INITIAL, some dimensionality); // only run once per frame
        context->launch(ENTRY_POINT_SUBSEQUENT, some reduced dimensionality); // run every subsequent iteration until image is computed

Every subsequent Optix kernel is launched with reduced dimensionality, so I assumed that must mean that more and more GPU resources become available in-between subsequent optix kernel launches. Which is why I decided to try and interleave some post-processing routine that can be easily applied to partial images. The code above is then slightly modified, such that the if-statement is followed by an invocation of a CUDA kernel:


According to Nsight, the CUDA kernel is indeed interleaved with the optix kernels, but only between “ENTRY_POINT_SUBSEQUENT” kernels (see the image below).

The OptiX kernels here are represented by “cuEventSynchronize” (as far as I understand). What I don’t understand is, for some reason “testCUDAKernel” is always serialized between “ENTRY_POINT_INITIAL” and “ENTRY_POINT_SUBSEQUENT” (here “cuMemcpy_2D_v2” event separates initial and subsequent runs), regardless of dimensionality of Optix kernel and CUDA kernel.

  1. Is there some kind of implicit synchronization taking place, when swapping between ray generation programs? Can it be avoided?

  2. What is this cuMemcpy2D_v2 event reported by Nsight?

  3. Initially I also tried running “testCUDAKernel” in a separate stream, but running it in a default stream similarly interleaves the kernels. Are OptiX 6.5 kernels run in separate streams?

I am using OptiX 6.5, GeForce Quadro P4000, 442.74 driver.

Please have a look into this chapter of the OptiX 6.5.0 Programming Guide which explains the only way to have asynchronous launches in OptiX 6 versions:

Any other launch mechanism is synchronous.

(Correction) The cuMemcpy2D is most likely an update of an internal data structure which happens when you change anything on the scene or rtVariables between launches.
Because of that it’s recommended to put variables you need to change regularly between launches into an input buffer and update its contents.
That doesn’t happen in OptiX 7 because there updating any data is your responsibility and done with CUDA API calls.

For complete control over GPU parallelism like that, you would need to use OptiX 7 which is using native CUDA for management of devices, contexts, and streams.
optixLaunch() calls in there are asynchronous and take a stream as argument. You decide what is happening asynchronously and when.

1 Like