Device to device bandwidth, bandwidth test vs theoretical maximum


Is it normal to get 236 GB/s device to device bandwidth with the bandwidth test in the samples for a GTX 780 Ti ? The theoretical peak is 336 GB/s which is 100 GB/s more…

What CUDA and driver versions are you using? For comparison purposes, can you please try the simple dcopy test app I attached to this previous post?

You would want to compile the dcopy app with -arch=sm_35 for use on the GTX 780 Ti. It would also be a good idea to try different vector lengths, as discussed in the forum thread. In general, the faster GPU platforms need longer test vectors to achieve the optimal memory throughput.

CUDA 6.0, driver 332.88.
I ran dcopy and the performance varies with each launch. It tends to get better if i run it multiple times in a row, probably has something to do with caching. Without any arguments I get results between 194-230 GB/s. With -n20000000 I get between 216-238 GB/s. With -n160000000 I get 233-237 GB/s. With -n32000000 I get out of memory errors already. So the results seem consistent with the bandwidth test from the samples.

So it’s still only 70% of theoretical bandwidth instead of the 85% you got with ECC off on the C2050.

Thanks for double checking, it seems there is nothing wrong with the device-to-device functionality in the driver, this looks like a hardware limitation.

The maximum of 85% efficiency without ECC used to hold true for multiple GPU generations. From what I understand about DDR interfaces, it was primarily limited by the read/write turnaround. The same effect is seen with CPU’s DDR3 memory subsystems. With ECC turned on it is then down to around 75% since on GPUs the additional ECC traffic is handled in-band, whereas on CPUs it is carried in a separate side band. For example my K20c has a theoretical bandwidth of 208 GB/sec, while measured copy bandwidth with ECC is 151 GB/sec (72.6% efficiency).

The GTX 780 Ti has no ECC support of course, so I am not sure why the efficiency is relatively low. As an additional experiment, you could check how the copy bandwidth varies with access width. The internal hardware queues that track outstanding loads and stores have a limited number of entries. Wider accesses may therefore be able to better saturate the memory bus (the total amount of data tracked by the queues increases as width of each tracked operation increases). You could change the type double in dcopy to double2 to generate 128 bit accesses to see whether this makes a difference.

The fact that the reported bandwidth changes from run to run might mean your GPU is downclocking, and your different values are dependent on just how fast you re-run and whether the GPU has spooled all the way up or down at the time. I don’t know what OS you’re running, but you might query nvidia-smi in a loop to keep printing the GPU frequency to make sure you’re at full boost clocks. Or in Windows there are GPU monitoring tools to watch the frequency changes, or even force maximum clocks.

Following up on SPWorley’s line of thought, section of the Kepler Tuning Guide may also be apply to the GTX 780 Ti (not sure whether it is based on the GK110B like the K40):

Installed NVidia Inspector, used it to force a PState with the clock turned up as far as it would let me (1020 Mhz) and now I’m getting 263 GB/s with the bandwidth test from the samples and 271 GB/s with dcopy -n160000000 :). 271 is 80% of the peak which is much better, still not 85% but indeed maybe wider accesses could improve this further. I’m also getting a constant 68 degrees temperature now though.