Volta 100 LINPACK performance and energy-efficiency

hofm · February 9, 2018, 10:57am

Hello there,

I noticed the release of CUDA 9.1.128, which includes improved GEMM-performance. Are these improvements mainly for small matrices (i.e., DL) or does LINPACK’s DGEMM profit as well. Where can we get the most up-to-date HPL binary for our Volta 100 cards?

Another question, related to the Student Cluster Competition. We’d like to squeeze as many GFlop/s/W out of the card as possible. As I’m sure you know, while DGEMM has some requirement on memory bandwidth, it is not very high. Currently we get close to 30 GFlop/s/W when using the optimum core frequency. We hope to improve on this value by lowering the clock (and thereby the dissipated power) of the memory system. However, nvidia-smi currently reports only a single supported memory clock:

$ nvidia-smi -q -d SUPPORTED_CLOCKS | grep Memory
Memory : 877 MHz

Is this a restriction imposed by the driver (and if so, can you provide a workaround) or is there simply no hardware support for changing the memory clock.

njuffa · February 9, 2018, 3:53pm

Do you have data on this? It’s been some time and my memory is hazy, but I seem to recall that GEMM would suck up as much as half the available (not: theoretical) memory bandwidth. This had come about because the FLOPS on GPUs kept increasing faster than available memory bandwidth. Maybe the jump in bandwidth provided by HBM2 has made the ratio more favorable again …

hofm · February 9, 2018, 4:54pm

Indeed, I have observed increasing demand to memory bandwidth, especially on CPUs where AVX, AVX2/FMA, AVX512 has been doubling the peak flops from tock to tock and bandwidth increased much slower; however, DGEMM is still far away from being compute-bound.

If your memory is correct and indeed you require (what in my opinion is “only”) half of the sustained bandwidth, then this qualifies as a huge energy-saving potential in my book. I could reduce the memory clock by a factor of almost two and bandwidth would not constrain performance. This should improve energy-efficiency (i.e., the sustained GFlop/s/W) significantly. Also, keep in mind that GPU core frequency is well below (around 1 GHz) its nominal value when we maximize for energy-efficiency, which again decreases the bandwidth requirement.

Another benefit of being able to set the memory clock is that it will get you higher performance when energy is not an issue. This might seem counter-intuitive at first but I have observed this effect on all post-HSW Xeon CPUs: The smaller the slice of the TDP pie for ‘non-compute’, the more frequency headroom for cores (as long as you don’t lower the memory clock so far that memory bandwidth becomes the bottleneck).

njuffa · February 9, 2018, 5:13pm

I was specifically referring to GEMM on GPUs. I do understand that lowering memory clocks could be beneficial to power draw if the full memory bandwidth is not required. When you performed such experiments (i.e. lowering memory clock) with GDDR based GPUs, how much improvement in the performance/power ratio did you find?

I have no knowledge of the details of HBM2 memory and its external interfaces, so I do not know whether they are in principle as flexible with regard to operating frequency as we are used to from regular old DRAM.

ethanxlj · February 26, 2018, 6:39am

Oh, in my P100 the highest SUPPORTED result is:
Memory 715 Graphics 1328 (MHz)

I don’t know if the poor result is concerned about the old version of HPL(official version is for Fermi)