Quadro GV100 gives so low memory bandwidth

I run samples/1_Utilities/bandwidthTest/bandwidthTest on a computer where a single 32GB GV100 installed, and the result is as below:


[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: Quadro GV100
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 12.3

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 13.2

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 539.0

Result = PASS

I don’t know why this gives such low memory bandwidth, compared to what the spec sheet says (around 900GB/s).
I exhibit such lower memory bandwidth in other programs too, so guess it’s not the problem of the sample benchmark.
I also fixed ‘Graphics’ and ‘Memory’ clock to 1132MHz and 850 MHz, which is confirmed in ‘Clocks’ in nvidia-smi -q.
Also, (nvidia-smi -q) says that it has Link width of 16x and PCIe generation of 3.
The driver version is 455.23.05 and tested under ubuntu 18.04.5 LTS and linux 5.4.0-48.

How can I deal with this problem? Is it a normal behavior?

Your numbers seem to be in-line with another report here

The bandwith test you ran is measuring host<->device bandwidth.

The 900GB/s figure you quote is transfer speed, GPU memory ↔ device (registers, shared memory etc).

@Robert_Crovella Still I don’t understand. I have a Tesla V100 PCIE 16GB, which has only slightly higher clock while having the same memory bit bus (4096-bit) as that of GV100, giving certainly 700+GB/s in both nsight compute and bandwidthTest.
@rs277 If 900GB/s is the bandwidth between its memory subsystem (cache<=>global memory) , why the other Tesla V100 gives so much higher performance on the same program? Would you like to share a possible clue?

I’m not sure I know what the reasons are, and I don’t happen to have a GV100 to play with.

A GV100 is a display-capable GPU, whereas most other V100 variants I am aware of are not (excepting Titan V). If you are running a display on this GPU or have the system configured to use this GPU as part of X, then I think it’s possible that the display activities might be consuming memory bandwidth on a continuous basis. If I were doing a comparison I would disable X and if possible move the console to another display device. But I don’t know if that accounts for the difference or not.

Apologies, I was too quick reading and failed to notice the last test was device<->device. I can’t offer any more than what you have already checked and Robert has offered.

Tried but sadly it does not work.
(there were no display activities, or running X)

It is a limitation of the Quadro V100 due to different refresh settings for the memory.

What do you mean by different refresh settings of memory?

Can I change it?

If not, then what was the point for promoting GV100 to have much higher 870GB/s bandwidth, thanks to HBM2?

I don’t think you can change it.
The CUDA sample is not optimized, if you run STREAM you can get 620 GB/s on a Quadro V100.

Device Selected 0: “Quadro GV100”

STREAM Benchmark implementation in CUDA
Triad: a(i) = b(i) + q*c(i)
Array size (67108864 double precision elements) = 512 MB
using 2 elements per thread, 64 threads per block, 524288 blocks

Function Rate (MB/s) Avg time Min time Max time Eff
Triad: 620367 0.002598 0.002596 0.002599 71.3 %

If you need more BW, you will need to use a Tesla V100.

NVIDIA doesn’t provide any way for you to do something like change the memory refresh settings on a GPU.

Every GPU has a difference between the measureable/achievable bandwidth and the stated (“peak theoretical”) bandwidth. These differences vary from one GPU type/design to the next. I’m reasonably sure the 870GB/s bandwidth number (or whatever the stated peak theoretical number for GV100) is a peak number, and is referring to peak theoretical bandwidth, and this is never achievable, on any GPU.

I’m not aware of any Quadro GPU that was released prior to the GV100 that had that level of memory bandwidth. Quadro P6000 would have been the previous “high-end” Quadro, and it has a lower memory bandwidth than the GV100.

Thank all for your help!