[Context : Host PC with Windows 10, program compiled with CUDA 10.2 up to compute capability 7.5 with PTX, running on 11.2 driver on a RTX 3090, secondary GPU being an old GeForce GTX 950]
I have a strange problem that I am currently investigating to fetch relevant information.
I have a scientific program using CUDA that runs @200 fps (on a RTX 3090). It is a single GPU program, starting with cudaSetDevice(0), and using several CUDA streams and synchronization.
Recently, I have added a second CUDA GPU to the host (GTX 950), being recognized as the cuda device 1.
But after that, the original program runs @100fps.
If I disable the second GPU in Windows device manager, the program runs @200fps again.
This is nonsense to me, but I must find what’s happening.
After a few profiling, it seems that the calls to cudaStreamSynchronize() seem to be slower (to be confirmed : I am not yet very used to with NSight Systems)
Really, the second GPU is unused, it is only queried for its properties at the beginning of the program, in order to display information. It could be selected (so that cudaSetDevice(1) could be used), but it is currently not the case.
Certainly not relevant : in the Windows Device Manager, the GTX950 is above the RTX 3090, while their respective cuda devices ids are 1 (GTX 950) and 0 (RTX 3090)
The monitor is plugged to the RTX 3090.
How can I track the problem ? Is there any known multi-GPU pitfall to handle ?