Optimizing the High Performance Conjugate Gradient Benchmark on GPUs

Originally published at: https://developer.nvidia.com/blog/optimizing-high-performance-conjugate-gradient-benchmark-gpus/

[This post was co-written by Everett Phillips and Massimiliano Fatica.] The High Performance Conjugate Gradient Benchmark (HPCG) is a new benchmark intended to complement the High-Performance Linpack (HPL) benchmark currently used to rank supercomputers in the TOP500 list. This new benchmark solves a large sparse linear system using a multigrid preconditioned conjugate gradient (PCG) algorithm.…

I have been looking for a good STREAM (Triad) implementation in CUDA. I hacked my own version which is however somewhat slower (247/217 GB/s) on the K40 (using GPU boost). Which implementation did you use? Is it available somewhere or can I just cite this post ? In that case do you have numbers for a K20?

Are you going to post this hpcg cuda optimized implementation either in binary or source code form to cuda registered developers?

I noticed that the "processor" count in the section "Analysis of the First HPCG List" does not match the values in the HPCG results for June 2014 (even after correcting for just counting the accelerators). Are there more detailed results available than the PDF table of results on the HPCG web site?

The numbers used in the table are the number of MPI processes. Since all the systems used a single MPI process per accelerator (Xeon Phi or Tesla K20x) or node ( in the case of K), it was the right metric to use to compute the available bandwidth of the systems. It also give a better idea of the size of the system. The official HPCG list shows the processor count, similar to HPL. You can request the full YAML files from the HPCG authors. During the HPCG BoF at SC14, the authors indicated that YAML files should be downloadable in the near future.

My very simple version of STREAM in CUDA delivers 150 GB/s on the K20 GPGPUs in the TACC Stampede system (ECC enabled). This is almost exactly what one would expect from linear scaling of the K20X result above by the peak memory bandwidths of the two models: 182 * 208/250 = 151 GB/s.

On the other hand, my version delivers unimpressive results on the K40 -- no more than 192 GB/s on the K40 GPGPUs in the TACC Maverick system (ECC enabled) -- about 12% lower than the value quoted above.

Which frequency did you use on the K40?
With ECC enabled I get 192GB/s with the base clock (745 Mhz) and 217 GB/s running at 875 MHz. This is a well known issue for the K40.
Given the agreement with my result I assume you just left the K40 at its base clock ?

Thanks! My K40 was running at the base clock (745 MHz on this system),
and I don't currently have permission to change the operating
frequency. I will go try to fix that problem. ;-)

745 MHz is the base clock of the K40. I mixed that up with the K20. I corrected my above comment.

Any plan to release the source code? HPCG website released some binaries by NVIDA but only for CUDA 8 and 6.5, couldn't run it :( why not the source?