Optimizing the High Performance Conjugate Gradient Benchmark on GPUs

jwitsoe · July 25, 2014, 1:41am

Originally published at: https://developer.nvidia.com/blog/optimizing-high-performance-conjugate-gradient-benchmark-gpus/

[This post was co-written by Everett Phillips and Massimiliano Fatica.] The High Performance Conjugate Gradient Benchmark (HPCG) is a new benchmark intended to complement the High-Performance Linpack (HPL) benchmark currently used to rank supercomputers in the TOP500 list. This new benchmark solves a large sparse linear system using a multigrid preconditioned conjugate gradient (PCG) algorithm.…

anon56323385 · October 23, 2014, 9:03pm

I have been looking for a good STREAM (Triad) implementation in CUDA. I hacked my own version which is however somewhat slower (247/217 GB/s) on the K40 (using GPU boost). Which implementation did you use? Is it available somewhere or can I just cite this post ? In that case do you have numbers for a K20?

anon18260394 · October 24, 2014, 12:57am

Are you going to post this hpcg cuda optimized implementation either in binary or source code form to cuda registered developers?

anon82000561 · December 5, 2014, 4:42pm

I noticed that the "processor" count in the section "Analysis of the First HPCG List" does not match the values in the HPCG results for June 2014 (even after correcting for just counting the accelerators). Are there more detailed results available than the PDF table of results on the HPCG web site?

anon30371095 · December 5, 2014, 11:46pm

The numbers used in the table are the number of MPI processes. Since all the systems used a single MPI process per accelerator (Xeon Phi or Tesla K20x) or node ( in the case of K), it was the right metric to use to compute the available bandwidth of the systems. It also give a better idea of the size of the system. The official HPCG list shows the processor count, similar to HPL. You can request the full YAML files from the HPCG authors. During the HPCG BoF at SC14, the authors indicated that YAML files should be downloadable in the near future.

anon82000561 · December 9, 2014, 3:33pm

My very simple version of STREAM in CUDA delivers 150 GB/s on the K20 GPGPUs in the TACC Stampede system (ECC enabled). This is almost exactly what one would expect from linear scaling of the K20X result above by the peak memory bandwidths of the two models: 182 * 208/250 = 151 GB/s.

On the other hand, my version delivers unimpressive results on the K40 -- no more than 192 GB/s on the K40 GPGPUs in the TACC Maverick system (ECC enabled) -- about 12% lower than the value quoted above.

anon56323385 · December 9, 2014, 3:41pm

Which frequency did you use on the K40?
With ECC enabled I get 192GB/s with the base clock (745 Mhz) and 217 GB/s running at 875 MHz. This is a well known issue for the K40.
Given the agreement with my result I assume you just left the K40 at its base clock ?

anon82000561 · December 9, 2014, 5:29pm

Thanks! My K40 was running at the base clock (745 MHz on this system),
and I don't currently have permission to change the operating
frequency. I will go try to fix that problem. ;-)

anon56323385 · December 9, 2014, 5:36pm

745 MHz is the base clock of the K40. I mixed that up with the K20. I corrected my above comment.

anon14666503 · March 24, 2017, 2:07am

Any plan to release the source code? HPCG website released some binaries by NVIDA but only for CUDA 8 and 6.5, couldn't run it :( why not the source?

Topic		Replies	Views
conjugate gradient CUDA Programming and Performance	16	9832	June 18, 2008
Settings for HPL CUDA Programming and Performance	7	4326	February 13, 2012
Superlinear Scaling of HPCG CUDA Programming and Performance	0	433	January 31, 2018
K20 with high utilization, but no compute processes. CUDA Setup and Installation	12	26686	March 19, 2015
Early comparison of Tesla K20c vs. Tesla K40x CUDA Programming and Performance	13	4650	January 9, 2014
HPL benchmark on A100(40GB PCIe) GPU-Accelerated Libraries cuda	1	1386	May 8, 2022
HPLinpack for CUDA Any interest? CUDA Programming and Performance	27	11953	May 10, 2012
Achive Performance in K620 GPU CUDA Programming and Performance	7	4192	December 30, 2014
HPL CUDA Programming and Performance	11	42384	July 18, 2011
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2233	January 18, 2023

Optimizing the High Performance Conjugate Gradient Benchmark on GPUs

Related topics