I’m facing an issue where I get inconsistent aggregate transfer speeds when using concurrent transfers to 4 GPUs.
I originally observed this when I was executing similar update/transfer work running on 4 separate threads, with each thread updating it’s own GPU device in CUDA. I was trying to do OptiX BVH builds, which seemed to take a much longer time on 2 threads than the other 2 threads. I added some timing around async memcpy using CUDA events, and observed that going from 1 device to 2 only added a small overhead, but going to 3 was a big jump, and then to 4 was another medium jump. The OptiX timings were a red herring because they were waiting on async transfer to complete.
I downloaded a bandwidth tester that tests bandwidth non-concurrently, and found that the tool reports that the first two GPUs get around 16 GB/s across PCI-E, while the last two GPUs get around 9 GB/s.
If I look at approximate aggregate transfer values based on my original program, I see that a single GPU gets decent performance (away from my machine right now and don’t want to guess at these numbers) but when I do multiple concurrently, the performance degrades drastically (with all 4 going at once, I see about an aggregate 25 GB/s, or a bit over 6 GB/s per GPU).
I tried using this utility: GitHub - enfiskutensykkel/multi-gpu-bwtest: Measure bandwidth of multiple simultaneously started cudaMemcpyAsync is what I was using for the standalone-ish behavior. For example, running like:
$ ./bwtest --do=all:HtoD:100000000
Allocating buffers…DONE
Executing transfers…DONE
Synchronizing streams…DONE
=====================================================================================
ID Device name Transfer size Direction Time elapsed Bandwidth
0 NVIDIA GeForce RTX 3090 95.37 MiB HtoD 6020 µs 16611.03 MiB/s
1 NVIDIA GeForce RTX 3090 95.37 MiB HtoD 5981 µs 16719.10 MiB/s
2 NVIDIA GeForce RTX 3090 95.37 MiB HtoD 10636 µs 9401.78 MiB/s
3 NVIDIA GeForce RTX 3090 95.37 MiB HtoD 10631 µs 9406.87 MiB/s
Aggregated total time : 33268 µs
Aggregated total bandwidth : 12023.53 MiB/s
Estimated elapsed time : 10742 µs
Timed total bandwidth : 37237.84 MiB/s
I’m also attaching a program I wrote this morning to dig further. Interestingly if I run serially or even concurrently async in the same (main) thread I usually get reasonable bandwidth numbers, but if I run independently in separate threads, that is no longer the case.
Here are the results I get with the attached program:
$ test_transfer_speeds – --pagelock=false --kilobytes=1000 --concurrency=0
Using 4 devices
Bandwidth for device 0 is 9.06156e+09
Bandwidth for device 1 is 9.23734e+09
Bandwidth for device 2 is 9.1473e+09
Bandwidth for device 3 is 9.0757e+09
$ test_transfer_speeds – --pagelock=true --kilobytes=1000 --concurrency=0
Using 4 devices
Bandwidth for device 0 is 1.76659e+10
Bandwidth for device 1 is 1.78821e+10
Bandwidth for device 2 is 1.78811e+10
Bandwidth for device 3 is 1.79021e+10
$ test_transfer_speeds – --pagelock=false --kilobytes=1000 --concurrency=1
Using 4 devices
Bandwidth for device 0 is 9.30287e+09
Bandwidth for device 1 is 1.42203e+10
Bandwidth for device 2 is 1.43691e+10
Bandwidth for device 3 is 1.53139e+10
$ test_transfer_speeds – --pagelock=true --kilobytes=1000 --concurrency=1
Using 4 devices
Bandwidth for device 0 is 1.81344e+10
Bandwidth for device 1 is 1.80353e+10
Bandwidth for device 2 is 1.79322e+10
Bandwidth for device 3 is 1.79735e+10
$ test_transfer_speeds – --pagelock=false --kilobytes=1000 --concurrency=2
Using 4 devices
Bandwidth for device 0 is 5.39684e+09
Bandwidth for device 1 is 5.71888e+09
Bandwidth for device 2 is 8.9671e+09
Bandwidth for device 3 is 6.3012e+09
$ test_transfer_speeds – --pagelock=true --kilobytes=1000 --concurrency=2
Using 4 devices
Bandwidth for device 0 is 1.24669e+10
Bandwidth for device 1 is 1.68448e+10
Bandwidth for device 2 is 2.08564e+10
Bandwidth for device 3 is 1.73517e+10
Note that this is not without a blip… “–pagelock=false --kilobytes=1000 --concurrency=1” shows one device getting much lower bandwidth, but it is conceivable that the OS schedules the main thread on another CPU for a time. If I run any combination often enough I will usually see some kind of blip where one device is slower than the others, but the serial and async concurrency modes are far more stable than the mode run concurrently in separate threads.
I have now added a mode for NUMA to bind to cores on either node 0 or node 1. This doesn’t appear to have much affect (I do see that on average transfers might be slightly faster on node 0 than node 1 with no concurrency, but this is by a factor of maybe 5%).
So my question is (a) should I observe drastically different bandwidths to different GPUs? and (b) should I observe this much slowdown when driving transfers across to all GPUs concurrently?
[actually, how do I attach the program? Let me know if I can attach, or if I need to make a reply with code inline]