200ms GPU Stall on cuCtxCreate

I’m using the CUDA driver api in a Unity Application. On my GeForce 970 machine the application performs quite well, however on a machine with a GeForce 1050 Ti I get a consistent ~200ms stall when the cuda context is created.

I’ve tried using the scheduling modes CU_CTX_SCHED_AUTO, CU_CTX_SCHED_SPIN, CU_CTX_SCHED_YIELD, and CU_CTX_SCHED_BLOCKING_SYNC but they all create the same stall. I’ve also tried using 0 as the device, or just passing the CUdevice from cuDeviceGet.

Replacing the context creation with a 300ms sleep statement causes no fps drop. This suggests that the stall isn’t from the context creation taking a while, but from the context creation stalling Unity’s gpu thread.

Any advice or suggestions would be more than welcome here, as I’ve been stuck on this for a few days now.

(1) What’s the stall time on the system with the Geforce 970?
(2) Are both machines running the same NVIDIA driver package and the same version of CUDA?
(3) What non-NVIDIA software differences exist between the machine (e.g. OS version)?
(4) What are the most important non-GPU hardware differences between the two systems (CPU, memory)

There’s essentially no stall on the 970 machine, at least none that I can measure.

Both are using driver version 398.36 and cuda 9.2. The 1050 machine has GeForce experience installed, but the 970 does not.

Both are Windows 10 x64. Almost all other software is the same.

The 970 machine has 16 GB of RAM, and an Intel i5-4590 CPU.
The 1050 machine has 8 GB of RAM, and an Intel i8-8700k CPU. This machine also has another PCI card in the x4 slot.

Both are desktops with the video card in the x16 slot.

Given that the machines are quite similarly configured, I don’t have any ideas that might explain the difference in behavior you are seeing. Assuming possible interference in the PCIe interconnect from the second PCIe card in the machine with the GTX 1050 seems a bit far-fetched, though not impossible.

An interesting experiment might be to switch the GPUs between the two systems to see whether the effect correlates with the GPUs or the system.

I assume that you are either using a fat-binary that contains machine code for both GPU architectures, or are building specifically for each GPUs architecture on that particular machine, i.e. no JIT compilation occurs.

I also have the same problem.

I have ten more processes that use CUDA on shared GPU memory running as daemon and some utils that starts periodically.

After utility start it call cuCtxCreate for creating cuda context and it cause a stall of other running CUDA contexts for 200…250 miliseconds.

I used GTX1050 and GTX1070 - sitation is the same - 200 ms stall on context creation.

Same stalls happens during nvdec and nvenc initialization by ffmpeg.

Simple solution was to use MPS, but in my case i reach 16 connection limit.