Strange performance regression with a single GPU context on a multi GPU host

[Context : Host PC with Windows 10, program compiled with CUDA 10.2 up to compute capability 7.5 with PTX, running on 11.2 driver on a RTX 3090, secondary GPU being an old GeForce GTX 950]

I have a strange problem that I am currently investigating to fetch relevant information.
I have a scientific program using CUDA that runs @200 fps (on a RTX 3090). It is a single GPU program, starting with cudaSetDevice(0), and using several CUDA streams and synchronization.

Recently, I have added a second CUDA GPU to the host (GTX 950), being recognized as the cuda device 1.
But after that, the original program runs @100fps.

If I disable the second GPU in Windows device manager, the program runs @200fps again.

This is nonsense to me, but I must find what’s happening.

After a few profiling, it seems that the calls to cudaStreamSynchronize() seem to be slower (to be confirmed : I am not yet very used to with NSight Systems)

Really, the second GPU is unused, it is only queried for its properties at the beginning of the program, in order to display information. It could be selected (so that cudaSetDevice(1) could be used), but it is currently not the case.

Certainly not relevant : in the Windows Device Manager, the GTX950 is above the RTX 3090, while their respective cuda devices ids are 1 (GTX 950) and 0 (RTX 3090)

The monitor is plugged to the RTX 3090.

How can I track the problem ? Is there any known multi-GPU pitfall to handle ?

The RTX 3090 has compute capability 8.6, and therefore should be used with CUDA 11. Is there a particular reason you are not doing that? The CUDA 10 toolchain cannot produce machine code for Ampere-class GPUs.

I cannot recall ever having encountered a scenario on Windows that matches your observations. I have a setup right now that is somewhat similar, in that I have two GPUs in my Windows 10 system, where the newer, more powerful GPU also drivers the display.

The first thing I would do if this were my system would be to update to the latest CUDA version and the latest NVIDIA driver package (460 driver family or newer).

You state the performance of your software in fps and round numbers. Is that because this program produces graphics output? And are these round numbers due to rounding of the actual frame rates for the purposes of the post, or due to the graphics syncing to the monitor refresh rate? If the latter, what happens if you disable sync?

The RTX 3090 has compute capability 8.6, and therefore should be used with CUDA 11.

The program is already compiled and has been produced with the 7.5 CC that was the latest at the time of the creation. Thanks to the PTX it can run on the Windows 10 machine/latest CUDA drivers (11.2). There is no problem here. Perhaps a performance boost could be observed if the program was recompiled for 8.6, but that’s not the point here.

I cannot recall ever having encountered a scenario on Windows that matches your observations. I have a setup right now that is somewhat similar, in that I have two GPUs in my Windows 10 system, where the newer, more powerful GPU also drivers the display.

And I can’t find anything related over the internet. I am puzzled.

The first thing I would do if this were my system would be to update to the latest CUDA version and the latest NVIDIA driver package (460 driver family or newer ).

The drivers are up-to-date thanks to GeForce Experience.

You state the performance of your software in fps and round numbers. Is that because this program produces graphics output?

No, it processes images, so the fps is a common measure for my use cases but everything is done in memory buffers, there is no rendering.

I find it very peculiar that performance was cut exactly in half. A correlation with display refresh is the only thing I can think of that would cause such an effect, where this can happen even when actual code run times differ only slightly between the two versions.

I have no plausible working hypothesis at this time. I assume you have already performed obvious sanity checks, e.g. making sure the timing framework is working correctly. One more data point would be what happens when you hide the second GPU with the CUDA_VISIBLE_DEVICES environment variable.

I find it very peculiar that performance was cut exactly in half.

My fault : it is not that precise, because in fact timings are not very stable. The 200fps means ~200/215fps, while the 100fps is something between 100~120fps. I simplified at first because I wanted to make it clear that it was a very real performance regression.

One more data point would be what happens when you hide the second GPU with the CUDA_VISIBLE_DEVICES environment variable.

Interesting, I did not know that trick, I will try.
Meanwhile I will also try to reproduce with a minimal program.

I assume you have excluded the possibility that in the slower runs the application is actually running on the GTX 950? My favorite go-to tool for monitoring GPU activity under Windows is TechPowerUp’s GPU-Z (free download).

Your assumption is right, I looked at the Windows 10 GPU activity monitor and always have Gpu-Z to make sure the PCI-Express is used at full speed.
With the GTX950 the program runs much slower.

Another aspect worth checking: Is the data produced by slow and fast runs bit-wise identical?

Speaking of PCIe transfers: Check that adding a second GPU does not reduce the PCIe configuration to x8 due to an insufficient number of PCIe lanes provided by the CPU. To operate two GPUs with PCIe gen x16 interfaces obviously requires >= 32 (commonly 40) PCIe lanes provided by the CPU.

It produces large floating-point data fields, so for now I only compared visual representation with false colors, and it looks the same.

What’s the CPU in this system? Does it provide enough PCIe lanes to operate two GPUs with x16 interfaces, or does plugging in two GPUs result in both operating with x8 interfaces?

The point of checking for bit-wise identical results was to establish whether slow and fast runs actually perform identical computation (which by all means they should).

Good news, I found the cause. Not the definitive reason, but a very huge hint.
-First, I will not answer the CPU/motherboard question because it will just add noise to the thread now that I identified the stumbling block. But to summarize : yes, CPU and motherboards have been dimensioned for such usage.
-Second, the “CUDA_VISIBLE_DEVICES” set to 0, 1, “0,1” and “1,0” just enforced the observation. With the 2 GPUs enabled in the Windows Device manager, setting CUDA_VISIBLE_DEVICES to 0 gave the fast run, CUDA_VISIBLE_DEVICES=0,1 the slow run, and CUDA_VISIBLE_DEVICES=1 a very slow run (as expected since in the latter case it was running on the GTX950)

Now, the big hint with the default mode (equivalent to “0, 1”, resulting in slow run)
-The NSight Systems confirmed that the GTX 950 was unused for computations, but strangely, a few MB were allocated on my benchmark start.
-I identified that in the cuda code, there was a call to cudaMallocManaged()
-After replacing the cudaMallocManaged() by cudaMalloc(), all performances problem went away : I had my fast run as expected.

Indeed, if memory was allocated on GPU 1 instead of GPU 0, some big overhead of memory transfer occured, resulting in performance loss.
So I have a fix, but I lack an explanation. Since cuda device 1 is never set to be active, why would cudaMallocManaged() allocate on that device ?

Without knowing the motherboard and system topology, it’s not possible to be definitive. However managed memory may behave differently in a multi-GPU setup, even if the “2nd GPU” isn’t used. This is documented.

To pick an example, if your GPUs are not able to be in a peer relationship then UM will create managed allocations in host pinned memory (“zero-copy”). This can very often create a significant negative perf impact, as you might imagine. I’m not suggesting this is a precise description of what is happening in your case, but in general your observation does not surprise me. The usual suggestion here is the one you already received - use CUDA_VISIBLE_DEVICES