Global memory bandwidth on GTX 690

anujkaliaiitd · September 12, 2014, 11:07pm

Hi.

One GPU in a GTX 690 has 192 GB/s bandwidth to global memory. However, I’m getting only 138 GB/s for global memory reads, and 175 GB/s for global memory writes. Why could this be happening?

My benchmark is available here: https://github.com/anujkaliaiitd/systemish/tree/master/gpu/seqMem . Most of the code is in the seqMem.cu file.

I’m getting around 128 GB/s from the nVidia benchmark from this blogpost: http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/ .

Robert_Crovella · September 13, 2014, 1:30am

what do you get by running the bandwidthTest CUDA sample code? (Device to Device Bandwidth)

The 192GB/s number is a peak theoretical number arrived at by calculating raw numbers from the DRAM interface:

(256 pins * 6.0Gbps/pin)/8 = 192GB/s

This number is not achievable in actual code. You will get some lower amount.

For instance, I have a K40c which has a nominal bandwidth of 288GB/s. With bandwidthTest I see 183GB/s. With the coalesced.cu from the blog link you provided, I see about 150GB/s bandwidth. (These numbers are with ECC on.)

I think in your case the numbers like 128GB/s or perhaps 138GB/s are probably reasonable (~2/3 of peak theoretical). The 175GB/s number strikes me as unlikely.

CudaaduC · September 13, 2014, 4:51am

You can also look at the profiler data of a large cudaMemset() and it will have a GBs number.

With the GTX 780ti I have seen over 300Gbs listed in such output, but not sure if it is accurate.

With Jimmy Petterson’s reduction code he was able to achieve 88% of bandwith on a Titan, and I can get usually get about 85% running that code on the GTX 780ti or a Tesla K20(with that same code which runs 100 times and averages).

anujkaliaiitd · September 13, 2014, 12:34pm

txbob: I get around 153 GB/s with bandwidthTest d2d.

What’s the reason for why we don’t get the peak theoretical bandwidth?

Robert_Crovella · September 13, 2014, 3:16pm

There are various overheads and inefficiencies that prevent user software from utilizing the full bandwidth. If you’re looking for a precise answer I don’t have it. In my experience, user code should be able to achieve about 90% of the bandwidth reported by bandwidthTest. Your mileage may vary. Timing methods may impact this as well, for example the use of host-based timing vs. cudaEvent timing, the use of multiple rounds of testing that are averaged, etc.

CudaaduC · September 13, 2014, 5:25pm

Jimmy P’s reduction output:

GeForce GTX 780 Ti @ 336.000 GB/s

N               [GB/s]          [perc]          [usec]          test
1048576         157.33                  46.82   26.7             Pass
2097152         192.49                  57.29   43.6             Pass
4194304         233.10                  69.38   72.0             Pass
8388608         258.17                  76.84   130.0            Pass
16777216        273.89                  81.51   245.0            Pass
33554432        281.41                  83.75   476.9            Pass
67108864        285.67                  85.02   939.7            Pass
134217728       287.55                  85.58   1867.1

Non-base 2 tests!

N               [GB/s]          [perc]          [usec]          test
14680102        272.84                  81.20   215.2            Pass
14680119        272.76                  81.18   215.3            Pass
18875600        270.54                  80.52   279.1            Pass
7434886         165.25                  49.18   180.0            Pass
13324075        247.17                  73.56   215.6            Pass
15764213        257.93                  76.76   244.5            Pass
1850154         65.80           19.58   112.5            Pass
4991241         148.23                  44.12   134.7            Pass