Launch size for best performances

I’m new to OptiX, and I try to understand some underlying mechanisms to improve the performances.

I make some tests to know the impact of the launch size (ie: width and height). I make those value vary, like, width=height, width=1 or height=1, for the same total size. For small values it seems there is no noticeable differences for the computational time.

But for higher total size I got some problem. For a size of 40 000, I have the same computational time for width=height, and height=1. But when I set width=1, it does not work, I got this error:

OptiX error: Unknown error (Details: Function "RTresult _rtContextLaunch2D(RTcontext, unsigned int, RTsize, RTsize)" caught exception: Encountered a CUDA error: cudaDriver().CuEventSynchronize( m_event ) returned (700): Illegal address)

The documentation says the product of width and depth must be smaller than 2^32. But what about the height?
And in my case, the total size is 40000, which is far smaller than 2^32!!!

Is the maximum size depends of the computations we are doing ?
In any case, how can we know the maximum size, to add some safe guards to avoid this error?

For information:
OS: Linux CentOS7
GC: GeForce RTX 2080 TI
Cuda: 10.1
OptiX: 6
Drivers: 418.67


Please use CUDA 10.0 to compile OptiX 6 *.cu device code to *.ptx as recommended in the OpenGL release notes.
You might also want to update your display driver to a more recent version.

How big you can choose the launch size is normally first limited by the amount of VRAM required, then by the time it takes to do a single launch, e.g. under Windows especially due to the Timeout Detection and Recovery (TDR) which kicks in on kernel drivers after 2 seconds already.
TCC driver mode (not available on GeForce) and Linux work differently. I’m not a Linux user though.

There should be no problem with 2D launch sizes with millions of launch indices, like for 4K resolutions.
I have not used any such extreme 2D extents like your case though.

You say a 40,000 x 1 launch size works but a 1 x 40,000 size doesn’t?
The first one is most likely handled as 1D launch internally and will use the hardware resources optimally.
The second is probably handled as a 2D launch and OptiX will use a different warp shape for that which is not going to be efficient at all, if that really happens. I would need to check.
My recommendation would be to avoid widths below 8 in 2D launches. For example you could also launch either as 200x200 size and calculate the actual linear index to write to inside your ray generation program.
On the other hand, depending on the amount of work per launch index, 40,000 are likely too few threads to saturate a high-end GPU.

Thanks for your answer, and sorry for the delay of mine.

Does anyone can tell me if there is something like the Windows TDR on Linux?

Is there a way to know the limits of threads the GPU can handle?

And I got a other problem. As I said, a 40,000 x 1 worked last week, but not today, while I did not change the OptiX part. And more, last week it worked for a size up to 9,000,000 (3,000 x 3,000), and now it does not work for a size of 1,000,000 (1,000 x 1,000).

Is there a tool to debug OptiX on Linux?


Sorry, there is not enough information to say what’s going on.
I’ll need to pass on Linux specific questions.

What changed on your system between last week and this week?

Did you try CUDA 10.0 as recommended?

Do the OptiX 6.0.0 SDK pre-compiled examples (if those exist under Linux) work on your system in that state?
Do the same examples work when you built them yourself?
Most of them allow resizing, so these allow running in different resolutions matching your failing case.

When there are size dependent issues, did you use the rtContextSetMaxTraceDepth and rtContextSetMaxCallableProgramDepth to set the minimum stack sizes necessary in your renderer?

The old rtContextSetStackSize API has no effect anymore in the RTX execution strategy!

How much VRAM is being used at the different sizes?
nvidia-smi installed with the NVIDIA display driver allows to dump that and other GPU information.

Unless some reboots nothing change on my system.

There is the optixConsole who does not work (it worked a month ago), all other examples run.

Yes, I setted both max. And in the closest_hit program I do a test to make sure it does not go over this limit.

nvidia-smi says my program use as much memory (535MB) for a 200 x 200 launch as for a 1,000 x 100 launch.

I installed CUDA 10.0, my CMake system could only find 10.1. So I removed it, and now CMake can not find any CUDA library, while in my main CMakeList.txt I have “find_package(CUDA 10.0 REQUIRED)”.

Regarding this:

I installed CUDA 10.0, my CMake system could only find 10.1. So I removed it, and now CMake can not find any CUDA library, while in my main CMakeList.txt I have "find_package(CUDA 10.0 REQUIRED)"

I installed CUDA from the runfile for my OS. You should be able to choose the path, which is usually something like /usr/local/cuda-10.0.

If you use the ncurses version of cmake (ccmake ) you should be able to edit the entries and set the path of your CUDA libraries. Otherwise, you can also directly edit the CMakeCache.txt in the SDK folder to set the proper path.

Thanks Esteban for your feedback. I can’t find any download for CUDA 10.0 on the NVidia website.

Go to [url][/url]
Click on “High Performance Computing”.
Click on “Download Now”, that leads to [url][/url]
Click on the big green “Legacy Releases” button on the right.
Pick the CUDA Toolkit release you want.

Thanks, I’m not very well awake this morning :s

I unisntalled everyting I found about CUDA 10.1, and then I installed everything about CUDA 10.0. But nvidia-smi still tell me that I have CUDA 10.1!!

Anyway, in /usr/local I only have a “cuda-10.0” directory, and CMake seems to use CUDA 10.0.

But, I still have the same problem. I can not launch big frames, 500 x 500 crash.

EDIT: by default my PATH is set to “/usr/local/cuda-10.1/bin/:/usr/local/cuda-10.1/NsightSystems-2018.3/”

The nvidia-smi is reading out the information of what CUDA version the display driver supports.
It doesn’t know about which CUDA toolkits you have installed.

You can install any number of CUDA toolkits side by side. They reside in different folders.
Which one you use is normally selected via the environment string CUDA_PATH (under Windows at least).

Did you try more recent drivers than 418.67?
The OptiX implementation comes with the driver since OptiX 6.0.0. There have been bug fixes in each driver version.

I would not consider a 500x500 launch size as big.
The more interesting questions are

  • What is the expected runtime of a single launch?
  • How many rays are you tracing in a single launch?

Does it fail with a launch with zero sizes already?
That could indicate an issue with the acceleration structure builder.
(That wouldn’t be the problem if it works with smaller launch sizes.)
In that case, what is the scene complexity?
Number of geometric primitives in summary?
Maximum number of primitives in one acceleration?
Depth of the scene hierarchy? (Maximum number of acceleration structures along a path from root to geometry node.)

You didn’t describe what your program does. If all other OptiX SDK examples run, there is nothing to analyze at this point without a minimal reproducer in failing state.

I found the origin of the problem.

I made minor change in my program, as small as I didn’t think it could come from here.

Before, I sent back from the GPU the coordinates of the interaction. For this I mapped a float3 buffer. Instead of the coordinate, the client wants the id of the object touch in an interaction. For I made it as follow:

When I create an object I add to it the variable “obj_index”, as an unsigned int.


For the buffer to get back the data I do:

optix::Buffer objHitIndex;
                                  MAX_DEPTH /* =10 */);

In the OptiX side, to declare both variables, I do:

rtDeclareVariable(unsigned int, obj_index, , );
rtBuffer<unsigned int, 3>   objHitIndex;

I can print the index in the “closest_hit” program without any problem, like this:

rtPrintf("obj_index: %d\n", obj_index);

The problems show up when I try to get it back:


When I comment this line I can do runs up to 3,000 x 3,000, when it’s uncommented I can only do runs up to 200 x 200!

For info, the index I use to address the buffer is the same as for other buffers, who work well.