CUDA won't concurrently run kernels on multiple devices from within same process

It seems that it is not possible to independently run kernels on a multi-GPU system in parallel from a single process, with multiple threads. When using multiple processes to do so, it works fine. Note that this issue manifests at random times. Sometimes I need to run the program a couple times before it starts. When it does start, the issue does not go away (even though the amount to which it affects runtime varies).

I originally reported the issue here: TensorRT unnecessary synchronization in multi-GPU system. But it seems it is not TensorRT-specific.

Environment

TensorRT Version: 8.5.2.2
GPU Type: 10x RTX A4000
Nvidia Driver Version: 525.60
CUDA Version: 11.8
CUDNN Version: 8.7.0
Operating System + Version: Ubuntu 22.04.1 LTS
Baremetal or Container (if container which image + tag): Baremetal

System

CPU:
  Info: 2x 20-core model: Intel Xeon Silver 4316 bits: 64 type: MT MCP SMP cache:
    L2: 2x 25 MiB (50 MiB)
  Speed (MHz): avg: 916 min/max: 800/3400 cores: 1: 801 2: 801 3: 800 4: 800 5: 800 6: 800
    7: 801 8: 801 9: 801 10: 801 11: 800 12: 801 13: 800 14: 801 15: 801 16: 800 17: 801 18: 801
    19: 800 20: 801 21: 801 22: 800 23: 801 24: 801 25: 800 26: 801 27: 801 28: 801 29: 2302
    30: 801 31: 801 32: 1890 33: 801 34: 801 35: 1085 36: 1403 37: 801 38: 1413 39: 1243 40: 801
    41: 800 42: 801 43: 801 44: 801 45: 801 46: 801 47: 801 48: 801 49: 801 50: 1648 51: 800
    52: 800 53: 800 54: 2301 55: 800 56: 800 57: 801 58: 801 59: 801 60: 801 61: 801 62: 801
    63: 801 64: 802 65: 800 66: 801 67: 801 68: 801 69: 1416 70: 801 71: 801 72: 1758 73: 800
    74: 800 75: 846 76: 801 77: 801 78: 1440 79: 897 80: 801
Graphics:
  Device-1: ASPEED Graphics Family driver: ast v: kernel
  Device-2: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-3: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-4: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-5: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-6: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-7: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-8: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-9: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-10: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-11: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Display: server: No display server data found. Headless machine? tty: 238x51

MRE

I modified the matrixMul sample in the CUDA samples. See code attached. The modified version runs one worker (in a thread) per GPU and then concurrently executes kernels from them.

Single-GPU

This runs a single worker thread, on a single GPU (to establish baseline):

CUDA_VISIBLE_DEVICES=0 ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
LAUNCH!
Performance= 1445.88 GFlop/s, Time= 0.091 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE

Runtime around 0.091 msec

Multi-threaded, multi-GPU

This one runs one worker per GPU, running the kernel 10k iterations as fast as possible. Even though the GPUs should operate independently, there seems to be synchronization as the average runtime drops by about 4x:

./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
Computing result using CUDA Kernel...
done
done
Computing result using CUDA Kernel...
Computing result using CUDA Kernel...
done
done
LAUNCH!
Performance= 413.12 GFlop/s, Time= 0.317 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 408.03 GFlop/s, Time= 0.321 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 398.93 GFlop/s, Time= 0.329 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 391.88 GFlop/s, Time= 0.334 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 393.24 GFlop/s, Time= 0.333 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 389.40 GFlop/s, Time= 0.337 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Performance= 392.55 GFlop/s, Time= 0.334 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Result = PASS
Performance= 390.90 GFlop/s, Time= 0.335 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: 
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 388.16 GFlop/s, Time= 0.338 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 386.39 GFlop/s, Time= 0.339 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE

Runtime around 0.33 msec (4x slower)

This is what the trace looks like:

Multi-process, multi-GPU

This sample does the same thing as above, except instead of running workers in threads, it starts one process per GPU and then runs the kernel 10k iterations (parallel over processes). This works:

CUDA_VISIBLE_DEVICES=0 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=1 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=2 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=3 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=4 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=5 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=6 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=7 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=8 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=9 CUDA_MODULE_LOADING=LAZY ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
Computing result using CUDA Kernel...
done
done
Computing result using CUDA Kernel...
done
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
Performance= 1447.73 GFlop/s, Time= 0.091 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1432.36 GFlop/s, Time= 0.092 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1450.05 GFlop/s, Time= 0.090 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1469.87 GFlop/s, Time= 0.089 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1456.77 GFlop/s, Time= 0.090 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1462.62 GFlop/s, Time= 0.090 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1426.73 GFlop/s, Time= 0.092 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1436.94 GFlop/s, Time= 0.091 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Performance= 1427.04 GFlop/s, Time= 0.092 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
DONE
Performance= 1443.78 GFlop/s, Time= 0.091 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE

Runtime around 0.09 msec (same as sequential, so fastest possible)

The trace:

Resources

Some more testing and it seems it generally takes around 3 or 4 runs before the issue starts arising. I’ve also found that:

  • The amount of runtime delay introduced by this issue varies from a factor 2x to 20x (!) with ten GPUs.
  • The more GPUs in the system, the more contention and the worse the runtime.
  • In some situations where a process is also issuing other short-running kernels, starvation can occur for a thread that is running lots of longer running kernels, completely halting the program.
  • Locking happens both during inference, as well as during other operations such as allocations. In TensorRT for example, this manifests in such a way that loading the model can take minutes because it is constantly being interrupted by other things.