OptiX Prime: disable automatic use of multiple GPUs?

Hi there, we are using OptiX Prime and are observing reduced performance when more than one CUDA devices are installed in the system.

We tracked it down to be due to OptiX Prime’s automatic distribution on multiple GPUs. We can alleviate it to some extent by using rtpContextSetCudaDeviceNumbers() after creating the OptiX Prime context, however we still observe unwanted memory allocations on all available CUDA devices by OptiX Prime.

Is there a way to disable automatic use of multiple GPUs both in terms of memory and computation resources when using OptiX Prime?

When using the environment variable CUDA_VISIBLE_DEVICES we can achieve the desired effect, however we would prefer not having to rely on this workaround as according to https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ this is supposed to be used for testing not in production.

Best regards

Just set the devices you want to use explicitly.

“However we still observe unwanted memory allocations on all available CUDA devices by OptiX Prime.”
Please explain. When is that happening? How much memory is allocated? How did you measure that?

Mind that OptiX Prime is not supporting Turing RTX hardware RT cores.

Please always list the following system configuration information when asking about OptiX isuses:
OS version, installed GPU(s), VRAM amount, display driver version, OptiX major.minor.micro version, CUDA toolkit version used to generate the input PTX, host compiler version.

Here our setup:
OS: OpenSuse Leap 42.3
GPUs: 2x GeForce RTX 2070 with each 8 GB VRAM
Driver: 410.93
OptiX: 4.1.1
CUDA: 8.0
Compiler: GCC 5.4.0

As mentioned in my previous message we aware of rtpContextCreate() and rtpContextSetCudaDeviceNumbers() and it helps with the performance issues.

I am looking at the output of nvidia-smi and setting a breakpoint on the call to rtpContextCreate(). Before that call no compute process was listed in nvidia-smi. Once the call to rtpContextCreate() completes I see compute processes on both GPUs each having a memory usage of 129MiB. I.e. it allocates memory on all GPUs before I can call rtpContextSetCudaDeviceNumbers().

Once I call rtpContextSetCudaDeviceNumbers() the memory on the “unused” GPU reduces to 115MiB.

Thank you for the info. We will keep that in mind.

I’m surprised that the Turing boards work at all with OptiX 4.1.1. That is from July 2017.
Nothing will happen on that and you’re missing out on lots of performance improvements.

Please try the most current versions of OptiX 5 and esp. OptiX 6 which added Turing support.
As always, please read the release notes before setting up an OptiX development environment.
Not all combinations of OptiX, CUDA and host compilers are compatible.

“I.e. it allocates memory on all GPUs before I can call rtpContextSetCudaDeviceNumbers().”

My first guess is that this is the native CUDA context initialization itself. I don’t know if OptiX Prime latches onto an existing CUDA context like OptiX. You could try to create one for a single device and see if OptiX Prime uses that.

Especially on Turing RTX boards using OptiX 6.0.0 (not Prime) with ray casting and hardware triangle intersection is going to be more efficient than OptiX Prime.

Thank you for your reply. At the moment I am not able to try newer versions of OptiX and switching from OptiX Prime to OptiX is not an option at this point unfortunately.

I have tried to call cudaSetDevice prior to the call of rtpContextCreate(), however to no avail.

What I did however was to call cudeResetDevice() on all devices that I do not want to use. This frees up the resources and seems to work so far.

Is this safe, i.e. are there any guarantees that OptiX Prime is only using memory and computation resources of devices that are not listed in rtpContextSetCudaDeviceNumbers()?