I’ve been trying to diagnose some difficult performance-related trouble in a dual-CPU, 10x RTX A4000 system. There seem to be multiple issues that cause lower than expected performance (see my earlier topic: Multi-GPU contention inside CUDA). After a lot of debugging I’ve identified one of the underlying problems, which is related to memory copies (both H->D, D->H). In general, memory throughput is much lower than expected as the load increases. The workload is a TensorRT model that is driven from two threads per GPU (20 threads in total in this case).
Low concurrency, low load
This is part of a trace that runs on 1 GPU, with a very light CPU and GPU utilization:
GPU utilization is between 30% and 70%, CPU less than 5%.
The red parts show D->H copies and the throughput varies a bit, the slowest one is 2.4GiB/s, the fastest is 6.5GiB/s. Though a bit flaky, these results are generally within expected parameters (I think).
High concurrency, medium load
This is the trace I get when running medium load, distributed over 10 GPUs (used MPS for this to rule out GPU context switching):
Note that the slowest copy is now around 690MiB/s, the fastest around 4.7GiB/s (we’re already seeing decreased memory transfer speeds, even though the load per GPU is less compared to the previous example).
Notes:
- GPU utilization is between 20% and 50%, hardly ever above 50%.
- CPU utilization is around 25%.
High concurrency, high load
Now, when trying to achieve full load (again 10 GPUs, with MPS), I get really bad results, for example:
Slowest copy throughput is 68MiB/s, fastest is around 2GiB/s. But a lot of the transfers are closer to the slow range. The expected throughput is around 5GiB/s (as tested on a lot of other non dual-CPU, non multi-GPU systems). In the above scenario, most copies are around the 200MiB/s range, severely slowing down inference and preventing fully utilizing the GPU. The CPU seems overloaded in this case, even though the load is around twice as much compared to the previous one (?).
Notes:
- GPU utilization is between 0% and 40%.
- CPU utilization is around 90%-100%. (load avg. is 120/80)
System details
CPU:
Info: 2x 20-core model: Intel Xeon Silver 4316 bits: 64 type: MT MCP SMP cache:
L2: 2x 25 MiB (50 MiB)
Speed (MHz): avg: 916 min/max: 800/3400 cores: 1: 801 2: 801 3: 800 4: 800 5: 800 6: 800
7: 801 8: 801 9: 801 10: 801 11: 800 12: 801 13: 800 14: 801 15: 801 16: 800 17: 801 18: 801
19: 800 20: 801 21: 801 22: 800 23: 801 24: 801 25: 800 26: 801 27: 801 28: 801 29: 2302
30: 801 31: 801 32: 1890 33: 801 34: 801 35: 1085 36: 1403 37: 801 38: 1413 39: 1243 40: 801
41: 800 42: 801 43: 801 44: 801 45: 801 46: 801 47: 801 48: 801 49: 801 50: 1648 51: 800
52: 800 53: 800 54: 2301 55: 800 56: 800 57: 801 58: 801 59: 801 60: 801 61: 801 62: 801
63: 801 64: 802 65: 800 66: 801 67: 801 68: 801 69: 1416 70: 801 71: 801 72: 1758 73: 800
74: 800 75: 846 76: 801 77: 801 78: 1440 79: 897 80: 801
Graphics:
Device-1: ASPEED Graphics Family driver: ast v: kernel
Device-2: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
Device-3: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
Device-4: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
Device-5: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
Device-6: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
Device-7: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
Device-8: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
Device-9: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
Device-10: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
Device-11: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
Display: server: No display server data found. Headless machine? tty: 238x51
Driver, CUDA, cuDNN, TRT versions
* NVIDIA Driver with version 525.60
* CUDA Toolkit with version 11.8 at /usr/local/cuda-11.8
* cuDNN with version 8.7.0 at /usr
* TensorRT with version: 8.5.2.2