Maximum Launch Dimension in OptiX 6

I have been trying for some time to transition my code from OptiX 5 to 6, however I seem to be running into an issue with the maximum launch dimension, which appears different between OptiX 5 and 6.

The documentation for the launch functions (rtContextLaunch[*]D) states “For 3D launches, the product of width and depth must be smaller than 4294967296 (2^32).”. I have found this to be true, except for OptiX 6. It appears that for v6+, the product of launch dimensions cannot exceed 1024^3. This seems to be the case for any dimension launch. When the total launch dimension exceeds 1024^3, I receive the error:

OptiX Error: 'Unknown error (Details: Function “RTresult _rtContextLaunch2D(RTcontext, unsigned int, RTsize, RTsize)” caught exception: Encountered a rtcore error: m_api.launch3D( cmdlist, pipeline, launchBufferVA, scratchBufferVA, raygenSbtRecordVA, exceptionSbtRecordVA, firstMissSbtRecordVA, missSbtRecordSize, missSbtRecordCount, firstInstanceSbtRecordVA, instanceSbtRecordSize, instanceSbtRecordCount, firstCallableSbtRecordVA, callableSbtRecordSize, callableSbtRecordCount, toolsOutputVA, toolsOutputSize, scratchBufferSizeInBytes, width, height, depth ) returned (9): Launch failure)

This can be reproduced using one of the SDK examples, in my case I was using “optixSphere”. I made the following changes:

  1. optixSphere.cpp line 42: change “width” to 2048*1024 and “height” to 1024 (or any width*height>1024^3).
  2. optixSphere.cpp line ~150: decrease buffer size to avoid potential out of memory error RT_CHECK_ERROR( rtBufferSetSize2D( *output_buffer_obj, 1, 1 ) );
  3. pinholeCamera.cu line 60: add ‘return;’ to beginning of launch function to isolate issue to launch configuration.

In v6.0+ (tested several), this will produce the error above at launch. In lesser versions, it will run without error. As a side note, my much more complicated production code exhibits the same behavior. This is problematic for me, since I have fairly large launches. I could tile them to 1024^3, but that will result in a lot of launches (sidebar question: what is the potential performance hit for this?). I’d like to upgrade to OptiX 7 one day, but I don’t have the time for it anytime soon.

OS: tried on Debian 9 and Ubuntu 18.04.6
CUDA: tried CUDA 9.0, 9.2, and 11.2
GPU: tried Titan Xp, V100

Thank you.

Hi @bnbailey,

Indeed the OptiX 6 maxiumum launch size is exactly 2^30, or 1024^3. The documentation stating it’s 2^32 is slightly out of date and wrong. The new launch size of 2^30 is intentional, not a bug, and it’s unfortunately not going to change. So, we should perhaps discuss strategies for ways to work around this limit, or whether your launch really needs to be that large.

Your sidebar question is relevant here, because once your launch dimensions exceed 2^30, if the threads are doing any work at all, it’s likely that the overhead of the launch itself is very small compared to the runtime of your kernel. This means that you can safely launch multiple times without incurring noticeable overheads. I just timed this hypothesis on my machine – a launch size of 2^30 with a raygen() function that has a return at the top takes approximately 10 milliseconds to complete. The launch function itself will be on the order of a few dozen microseconds I believe, so the overhead of the launch in this case is maybe in the range of 0.1% - 0.2%, and that’s for an empty kernel. If you do some rendering and store a result then your kernel is going to take longer, which will shrink the launch overhead even further.

My first question is, are you writing a separate value to global memory for every separate thread in your launch? Or are you perhaps using atomics or indexing to write to a buffer that is smaller than your launch? Would your setup potentially allow for putting a loop in raygen, for example, to reduce the launch size? Could your launch size be equal to the number of pixels you’re rendering?

Tiling by launches of 2^30 is, as you noted, a potential option. The overhead of multiple launches should be pretty low, though depending on whether you have long-running threads, the overhead certainly could be higher than the ~0.2% I quoted above (because of the long-tail effects of multiple separate tiles).

OptiX 7, BTW, has the same launch size limit, so while we do encourage upgrading when possible, it wouldn’t actually help with this specific issue. One thing using OptiX 7 could potentially help with here is that it’s much easier in OptiX 7 to use multiple CUDA streams, and if you’re tiling your launches, it might be advantageous for you to further eliminate launch overheads by launching multiple kernels at the same time on different streams. Just something to consider for the future when you have more time.


David.