Grid/block size and performance changes between OptiX 3.9.1 and 4.1.0

I’m porting some sofware from OptiX 3.9.1 / CUDA 7.5 / Visual Studio 2013 to OptiX 4.1.0 / CUDA 8.0.61 / Visual Studio 2015, and seeing a substantial performance hit (~50-100% execution time increase). Profiling a test case with Nsight (on a Quadro K1100M with driver version 376.62, Windows 10) provided the numbers below.
This particular example takes about 50% longer with OptiX 4.1.0. I’m wondering if some or all of that is related to having a larger block size and only one active block per SM. I’d like to experiment with that to see how it impacts performance, but as far as I know there’s no direct way to specify block sizes within OptiX. Is there anything I can do to influence OptiX’s block dimension computations? Initially at least this would just be an experiment, so the method doesn’t need to be very robust.
Also, for curiosity’s sake: I’m doing a 1D OptiX launch in this example, so why is OptiX generating 3D block dimensions internally? I was under the impression the 2D/3D blocks were just for convenience, so generating a 3D block for a 1D launch seems weird. Is there some performance advantage to 3D blocks?

OptiX 3.9.1:
Registers per Thread: 63
Threads/Block: 256
Grid Dimensions: {8,1,1}
Block Dimensions: {32,8,1}
Active Blocks (Per SM): 4
Achieved Occupancy: 50.00%

OptiX 4.1.0:
Registers per Thread: 63
Threads/Block: 1024
Grid Dimensions: {2,1,1}
Block Dimensions: {8,4,32}
Active Blocks (Per SM): 1
Achieved Occupancy: 49.99%


With respect to the overall runtime performance, do you happen to have any OptiX exceptions enabled?
So far OptiX 4.x is slower than 3.9.1 when code is generated to handle exceptions.
Benchmarks should be run with exceptions and prints disabled for that reason.

OptiX controls the whole GPU scheduling. That abstracts the hardware dependent behavior to allow the single ray programming model and automatic multi-GPU support. Other than the launch sizes, there is no way to influence that.

I had been running with stack overflow and user exceptions enabled, but disabling them does not seem to significantly change the performance. Prints are also disabled.

Thinking that perhaps the older Quadro K1100M was part of the problem, I ran the same benchmark on a GeForce GTX 1070 (driver version 376.53, Windows 10). On that system OptiX 4.1.0 execution time is ~140% longer than OptiX 3.9.1.

The code I’m running is memory intensive, with large ray payloads and material calculations that require lots of support data. Any suggestions on things to try to get better performance with OptiX 4.1.0 for memory intensive applications? I did notice the precompiled whitted/optixWhitted example got slower in OptiX 4.1.0, but the example code changed so maybe that comparison isn’t meaningful.

The release notes for OptiX 3.9.1 don’t list CUDA 8.0 as supported (if I recall correctly, it hadn’t been released yet). Can OptiX 3.9.1 be used with CUDA 8.0, or would that be error prone?


OptiX 3.9.1 does not parse PTX code generated with CUDA 8.0. CUDA 7.5 would be the maximum supported version there.
I’ve been using CUDA 8.0 on OptiX 4.x for quite some time and it generates slightly better code.

We’re going to check the SDK examples for the performance difference.

Shameless marketing ahead. ;-)
Gaphics boards with faster memory would help. Quadro P6000 would be the most versatile (graphics, compute, VR) board with most memory (24 GB GDDR5X), and Quadro GP100 (16 GB HBM2) is a compute monster which is even faster for GPU raytracing.

From algorithmic point, I’ve implemented an OptiX path tracer which shows one method to support the NVIDIA Material Definition Language (MDL).
That renderer currently uses an rtPayload of 256 bytes which is considered heavy. I use that per-ray data (“prd”) as main interface for input and output values between programs, plus additional structures for the current MDL state attributes (“State”) and material hierarchy configuration data (“Traversal”). See presentations below.

I tried to reduce the kernel code size as much as possible by using a single closest hit program and implement all material and lighting relevant functions (the “fixed-function” building blocks) as bindless callable programs.

The material behaviour is constructed as a bindless callable program as well. That represents the possibly complex layered material configuration and routes the user defined input values defined in the MDL files to the fixed-function building blocks. The code for that is auto-generated at runtime per material shader.

These two presentations from GTC 2016 and 2017 show some general overview of that renderer architecture and its capabilities:

The slide 12 in the GTC 2017 presentation contains a picture of the current renderer core programs. The many dark blue BSDF programs on the right exist twice for sampling and evaluation. I only download the ones which are actually used in a scene.
The code excerpts in the 2016 presentation are a little outdated. The current architecture uses only bindless callable programs, which was always the plan and allows a more flexible design, and the getter function code is more condensed now.

I’m seeing that same (or another) slowdown after upgrading from 3.9.1 to 4.1.0 / 4.1.1 on a GTX 960M. Not exactly the target GPU for OptiX, but still, it’s a bit sad that upgrading OptiX to use the newest CUDA version means taking a performance penalty.

I measured the performance using Nsight and the results are
OptiX 3.9.1, CUDA 7.5: Grid: [20, 1, 1], Block: [32, 8, 1], Duration: 65μs, Occupancy: 50%, Registers: 64.
OptiX 4.1.1, CUDA 8.0: Grid: [10, 1, 1], Block: [8, 4, 14], Duration: 82μs, Occupancy: 43.75%, Registers: 72.

I will be happy to provide the full reports or traces if they are of any interest.

/Asger Hoedt

Hi Asger, I’d be curious to see which scenes showed the slowdown for you, if you want to send a trace t optix-help or directly to one of the moderators.

Hey Dylan. Fabulous! I’ll try to find time to create one over the weekend. Do you need a trace pr OptiX version or just one. Also, can you tell me how to create a trace? I can’t remember the name of the environment variable or if I had to do anything but set it to some folder.

It was a Cornell Box with a sphere light and one of the cubes being metallic. I see it with both the BVH and TrBVH, but obviously the acceleration structure choice has little effect. ;)

I can also try a few of the sample scenes to see if I see the same slowdown there.


I would not be surprised to see performance regressions on very simple scenes, e.g., Cornell box. Same for the SDK samples. 4.1 should do better on more complex scenes with more geometry.

Arh great. Of course that sucks for all my small development scenes, but it’s great for all the real use cases.

I’ll give San Miguel a spin later and see how that behaves. :)