I’ve been trying to reproduce this issue and I think I’ve gotten somewhere. I’m using bandwidthTest
.
Single GPU case
To start, I ran this command to get a baseline reading on a single GPU:
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=0 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh
The results are:
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA RTX A4000
Range Mode
Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
20000000 5125.6
21000000 5187.4
22000000 5186.1
23000000 5184.9
24000000 5177.7
25000000 5188.1
26000000 5182.6
27000000 5191.7
28000000 5182.7
29000000 5192.6
30000000 5191.4
Result = PASS
So around 5.2 GiB/s which is great.
Multi-GPU case
Then, I ran the test for each device simultaneously using this script:
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=0 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=1 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=2 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=3 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=4 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=5 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=6 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=7 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=8 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=9 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh
It’s not perfect but a quick glance at nvidia-smi
shows that at least 8 GPUs were around 40% at the same time so there’s definitely some overlap and the result show this as well:
(Just showing part of it because there is so much output:)
Transfer Size (Bytes) Bandwidth(MB/s)
20000000 4280.2
21000000 3949.6
22000000 3902.3
Another one looks like this:
Transfer Size (Bytes) Bandwidth(MB/s)
20000000 4448.5
21000000 4516.0
22000000 4628.8
Generally, I’m getting anywhere from 3.5GiB/s up to 5GiB/s.
Multi-GPU with high CPU load
I did a simple stress test with stress --cpu 160 -t 60
(all cores will be doing sqrt
on repeat) and ran the multi GPU test again, and got:
20000000 2589.2
21000000 2157.9
22000000 2225.7
23000000 1937.3
...
Transfer Size (Bytes) Bandwidth(MB/s)
20000000 1967.4
21000000 1592.4
22000000 1618.9
The range is between 1.5GiB/s up to around 4GiB/s. Already we’re seeing a decrease in transfer speeds due to (unrelated) CPU load.
Multi-GPU with RAM load
And again with stress --vm 160 -t 60
which will do malloc
free
in a loop on all cores:
Snippets from the results:
Transfer Size (Bytes) Bandwidth(MB/s)
20000000 1037.3
21000000 828.2
22000000 842.9
...
20000000 1618.3
21000000 1476.4
22000000 1139.5
23000000 955.0
So CPU usage seems to play a role here. I still did not get 12MiB/s transfer speeds but I’m getting closer to what I’m seeing in the traces.
Do you have any idea how CPU load might play a role in this? Is it simply that the OS scheduler is not giving CUDA enough time to fetch data from the device?