Device to device bandwidth, bandwidth test vs theoretical maximum

Wyk3d · May 26, 2014, 4:54pm

Hello!

Is it normal to get 236 GB/s device to device bandwidth with the bandwidth test in the samples for a GTX 780 Ti ? The theoretical peak is 336 GB/s which is 100 GB/s more…

njuffa · May 26, 2014, 5:29pm

What CUDA and driver versions are you using? For comparison purposes, can you please try the simple dcopy test app I attached to this previous post?

[url]Quadro 4000 Bandwidth The device to device bandwidth obtained with - CUDA Programming and Performance - NVIDIA Developer Forums

You would want to compile the dcopy app with -arch=sm_35 for use on the GTX 780 Ti. It would also be a good idea to try different vector lengths, as discussed in the forum thread. In general, the faster GPU platforms need longer test vectors to achieve the optimal memory throughput.

Wyk3d · May 26, 2014, 5:59pm

CUDA 6.0, driver 332.88.
I ran dcopy and the performance varies with each launch. It tends to get better if i run it multiple times in a row, probably has something to do with caching. Without any arguments I get results between 194-230 GB/s. With -n20000000 I get between 216-238 GB/s. With -n160000000 I get 233-237 GB/s. With -n32000000 I get out of memory errors already. So the results seem consistent with the bandwidth test from the samples.

Wyk3d · May 26, 2014, 6:13pm

So it’s still only 70% of theoretical bandwidth instead of the 85% you got with ECC off on the C2050.

njuffa · May 26, 2014, 7:06pm

Thanks for double checking, it seems there is nothing wrong with the device-to-device functionality in the driver, this looks like a hardware limitation.

The maximum of 85% efficiency without ECC used to hold true for multiple GPU generations. From what I understand about DDR interfaces, it was primarily limited by the read/write turnaround. The same effect is seen with CPU’s DDR3 memory subsystems. With ECC turned on it is then down to around 75% since on GPUs the additional ECC traffic is handled in-band, whereas on CPUs it is carried in a separate side band. For example my K20c has a theoretical bandwidth of 208 GB/sec, while measured copy bandwidth with ECC is 151 GB/sec (72.6% efficiency).

The GTX 780 Ti has no ECC support of course, so I am not sure why the efficiency is relatively low. As an additional experiment, you could check how the copy bandwidth varies with access width. The internal hardware queues that track outstanding loads and stores have a limited number of entries. Wider accesses may therefore be able to better saturate the memory bus (the total amount of data tracked by the queues increases as width of each tracked operation increases). You could change the type double in dcopy to double2 to generate 128 bit accesses to see whether this makes a difference.

SPWorley · May 27, 2014, 2:30am

The fact that the reported bandwidth changes from run to run might mean your GPU is downclocking, and your different values are dependent on just how fast you re-run and whether the GPU has spooled all the way up or down at the time. I don’t know what OS you’re running, but you might query nvidia-smi in a loop to keep printing the GPU frequency to make sure you’re at full boost clocks. Or in Windows there are GPU monitoring tools to watch the frequency changes, or even force maximum clocks.

njuffa · May 27, 2014, 3:12am

Following up on SPWorley’s line of thought, section 1.4.4.5 of the Kepler Tuning Guide may also be apply to the GTX 780 Ti (not sure whether it is based on the GK110B like the K40):

[url]Kepler Tuning Guide :: CUDA Toolkit Documentation

Wyk3d · May 27, 2014, 11:32am

Installed NVidia Inspector, used it to force a PState with the clock turned up as far as it would let me (1020 Mhz) and now I’m getting 263 GB/s with the bandwidth test from the samples and 271 GB/s with dcopy -n160000000 :). 271 is 80% of the peak which is much better, still not 85% but indeed maybe wider accesses could improve this further. I’m also getting a constant 68 degrees temperature now though.

Topic		Replies	Views
Bandwidth measurement Theortical bandwidth vs BandwidthTest(SDK) results CUDA Programming and Performance	4	1558	May 30, 2011
THEORETICAL BANDWIDTH vs EFFECTIVE BANDWIDTH CUDA Programming and Performance	13	6893	February 23, 2017
Quadro 4000 Bandwidth The device to device bandwidth obtained with CUDA Programming and Performance	8	3518	March 7, 2011
Global memory bandwidth on GTX 690 CUDA Programming and Performance	5	1566	September 13, 2014
Low device-cpu bandwidth for GTX 1080 TI CUDA Programming and Performance	3	1026	November 13, 2019
Low Device to Device Bandwidth CUDA Programming and Performance	11	3413	May 4, 2009
upper limit for memory bandwidth on the device ? CUDA Programming and Performance	13	11246	July 8, 2009
memCpy : Device to Device VERY SLOW CUDA Programming and Performance	7	2824	September 13, 2009
Driver for GTX 1080 Ti CUDA Programming and Performance	21	19639	June 22, 2017
Using bandwidthTest, D2D performance exceeds theoretical bandwidth CUDA Programming and Performance cuda	1	396	October 27, 2022

Device to device bandwidth, bandwidth test vs theoretical maximum

Related topics