CUDA/Optix GPU Utilisation

I’m trying to profile and optimise our ray-tracer built with Optix 5.0, particularly the fact that we’re seeing almost no performance increase from running it on my laptop which has a GeForce GTX 970M, to running it on the production server which has a Tesla K80. Now the Optix documentation has some simple useful bullet points for optimisation, but it’s really hard to find more detailed information regarding how structure of the Optix application may affect the CUDA kernel performance.

After using the Visual Profiler it seems quite clear that GPU utilisation is being limited by register usage. The megakernel is using 72 registers per thread resulting in only 2 blocks being executed concurrently out of a max of 32 (on my 970M). Now this may be why we’re seeing no speedup on devices with increased CUDA cores so while looking into it I thought, well maybe it’s because we’re performing iterative tracing as opposed to recursive. So I decided to profile some of the precompiled examples and found the same thing, the lowest I found was the path tracer example that used 64 registers per thread.

So really my questions are; is this a problem with the base megakernel ? Is it stopping better GPU utilisation ? And is it stopping us from seeing performance improvements on better GPU’s ? (I realise that one is potentially difficult to answer given lack of info). Is better GPU utilisation even possible with the Optix API or only possible with the Optix Prime API ?

A GPU utilisation guide for Optix would be super useful.

First, you’re comparing two different GPU architectures. GTX 970M is a Maxwell GM204 and the Tesla K80 has two Kepler GK210.
While that Kepler GPU is the top of the line of that generation, it’s still a GPU generation older.

Then you’re comparing single GPU vs. multi GPU.

If you used both devices on the K80, how does it perform when only using one of its devices?

If the scaling with two GPUs over that is not almost a factor of two, you’re probably limited by access the pinned memory input_output buffer. In that case you can use RT_BUFFER_GPU_LOCAL buffers for the accumulation and to reduce the PCI-E bandwidth requirements just write out the final result to an output buffer which is in pinned memory

Avoid float3 output and input_output buffers, esp. in multi-GPU configurations!

I’m assuming the display of intermediate results is not factored into these measurements?
Because on the GTX 970M you could use OpenGL interop for that and on boards running in TCC driver mode or on any multi-GPU configuration you cannot.

Iterative algorithms are normally better for the amount of stack space needed. The smaller the OptiX stack size value, the better. You need to determine the smallest working stack size value with each new OptiX version.

Given that OptiX contains the same BVH traversal code as OptiX Prime and is more flexible to program and offers a lot more built-in features (custom primitive intersections, any-hit continuations, and more flexible scene layouts, more automatic multi-GPU support, motion blur, etc.), it’s suitable for a lot more use cases than OptiX Prime.
There is the optixRaycasting SDK example which shows how to use OptiX just for intersection testing.

I have written many iterative progressive path tracers and all of them got faster when switching to newer GPU generations. Try benchmarking on a Pascal or Volta GPU. Even there, the high end compute chips like on the Quadro GP100 will show superior ray tracing performance with OptiX.

Ok we’re going to do some testing on our Amazon P2 instance later and try to get some information with one of the gpu’s disabled if possible.

Are there any examples of RT_BUFFER_GPU_LOCAL usage ? And yeah we don’t use any float3 buffers.

The display of results isn’t factored in, we just perform tracing and encode the resulting buffer to a file after all samples have been taken.

Is there a way to explicitly set the stack size ?

We’re going to do some testing on an Amazon P3 instance which boasts 4x Tesla V100 GPU’s. Unfortunately I think that’s the only other GPU generation option available through Amazon EC2.

So thanks for all that info and we do have plans to switch to iterative tracing at some point, but what about the fact that we can only execute two blocks in parallel due to thread register usage ? It seems that even if we make those optimisation we’ll still be missing out on a lot of parallelism if I understand right (which I might not). Or will the stack size directly affect that ?

Yes, the optixVox example in the OptiX Advanced Samples on github does accumulation to a GPU local input_output buffer and then writes to a final output buffer.
Link to the OptiX Advanced Examples in the sticky post at the top of this sub-forum.

Also see multi-GPU discussions here with final conclusions that accumulation over RT_BUFFER_GPU_LOCAL is supposed to work

“Is there a way to explicitly set the stack size ?”

Yes, rtContextSetStackSize().
You normally set that once when specifying how many entry points and ray types you use.
Standard way to figure out a minimal working stack size for OptiX:
It’s kind of annyoing and there are plans to change that whole stack size mechanism to something simpler in the future.

The number of registers used at maximum is hardcoded per GPU inside OptiX. The number of blocks is what it is. You shouldn’t need to care about that. If your launch size is big enough (very much over 65536 e.g. in the >= 1 million rays per launch), that should use the GPUs to their full capacity.

“4x Tesla V100 GPU’s”
Awesome. You’re probably in for a big surprise about how fast things can go.
The accumulation strategy should definitely be changed before that to get the most out of any multi-GPU configuration.
Also please try to minimize the stack size before, because that Volta GPU has just so many cores that it eats up a lot of VRAM if you’re not careful.

Thanks that information on stack size is really helpful. Definitely going to make those changes on accumulation.

Our launch size is 4096x2048 so we’re safely over that, thanks that was a big help.