One GPU in a GTX 690 has 192 GB/s bandwidth to global memory. However, I’m getting only 138 GB/s for global memory reads, and 175 GB/s for global memory writes. Why could this be happening?
My benchmark is available here: https://github.com/anujkaliaiitd/systemish/tree/master/gpu/seqMem . Most of the code is in the seqMem.cu file.
I’m getting around 128 GB/s from the nVidia benchmark from this blogpost: http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/ .
what do you get by running the bandwidthTest CUDA sample code? (Device to Device Bandwidth)
The 192GB/s number is a peak theoretical number arrived at by calculating raw numbers from the DRAM interface:
(256 pins * 6.0Gbps/pin)/8 = 192GB/s
This number is not achievable in actual code. You will get some lower amount.
For instance, I have a K40c which has a nominal bandwidth of 288GB/s. With bandwidthTest I see 183GB/s. With the coalesced.cu from the blog link you provided, I see about 150GB/s bandwidth. (These numbers are with ECC on.)
I think in your case the numbers like 128GB/s or perhaps 138GB/s are probably reasonable (~2/3 of peak theoretical). The 175GB/s number strikes me as unlikely.
You can also look at the profiler data of a large cudaMemset() and it will have a GBs number.
With the GTX 780ti I have seen over 300Gbs listed in such output, but not sure if it is accurate.
With Jimmy Petterson’s reduction code he was able to achieve 88% of bandwith on a Titan, and I can get usually get about 85% running that code on the GTX 780ti or a Tesla K20(with that same code which runs 100 times and averages).
txbob: I get around 153 GB/s with bandwidthTest d2d.
What’s the reason for why we don’t get the peak theoretical bandwidth?
There are various overheads and inefficiencies that prevent user software from utilizing the full bandwidth. If you’re looking for a precise answer I don’t have it. In my experience, user code should be able to achieve about 90% of the bandwidth reported by bandwidthTest. Your mileage may vary. Timing methods may impact this as well, for example the use of host-based timing vs. cudaEvent timing, the use of multiple rounds of testing that are averaged, etc.
Jimmy P’s reduction output:
GeForce GTX 780 Ti @ 336.000 GB/s
N [GB/s] [perc] [usec] test
1048576 157.33 46.82 26.7 Pass
2097152 192.49 57.29 43.6 Pass
4194304 233.10 69.38 72.0 Pass
8388608 258.17 76.84 130.0 Pass
16777216 273.89 81.51 245.0 Pass
33554432 281.41 83.75 476.9 Pass
67108864 285.67 85.02 939.7 Pass
134217728 287.55 85.58 1867.1
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 272.84 81.20 215.2 Pass
14680119 272.76 81.18 215.3 Pass
18875600 270.54 80.52 279.1 Pass
7434886 165.25 49.18 180.0 Pass
13324075 247.17 73.56 215.6 Pass
15764213 257.93 76.76 244.5 Pass
1850154 65.80 19.58 112.5 Pass
4991241 148.23 44.12 134.7 Pass