HPL2 for Nvidia Tesla Platform

I was wondering where I could find the HPL2 Nvidia code to run on a cluster of K80 Tesla GPUs.


I don’t know what HPL 2 is. But a particular version of GPU-enabled HPL for general usage is available to registered developers. You can register at developer.nvidia.com Once your registration is processed, you will have access to a variety of downloads, one of which is “CUDA Accelerated Linpack” which contains source code you can build to run a version of GPU-enabled HPL on your machine or cluster.

How many K80’s do you have in your cluster?

The available on the registered developer web site is quite old and not fully optimized. It is still valuable for system stressing and several people have used part of the code (the DGEMM intercept) for other projects.
How big is your cluster? If you are working on a Top500 submission, please contact Nvidia and we can make the optimized
binary available to you.

Hello DrM,

Our environment is a Haswell cluster with 2 K80s per node connected via a non-blocking FDR IB fabric. The size of the cluster is modest. This environment is going to be used for R/D work on stencil-based and compression codes, among others.

On that note, could gpu enabled HPL leverage GDR? The cluster is GDR-ready, but we have not tested yet the GDR functionality to any extent.

Also, could you recommend any other parallel application (besides HPL) that could leverage GPU-enabled HPC clusters, with an eye towards finite-difference solutions, adaptive, irregular or multi-grid type of problems or spare-matrix computation?

Lastly, CUDA vs OpenACC: it was discussed in a recent OpenACC workshop that we can get quite high performing GPU-parallel code through OpenACC and on occasion even better than hand-coded CUDA code. Do you have a sense that this is the case?

Thank you and apologies for all these questions !

The HPL code does not use GDR.

For simple nested do loops OpenACC could do a good job.
WIth CUDA and CUDA Fortran you have full control and have a more flexible tool.

Does an FDR Haswell cluster with 230 K80 (2 per node) GPUs have a good chance to enter top500 for Nov 2015? If yes, then YES we are shooting for top500 SC2015.


Yes, it’s possible. I believe someone will contact you shortly via the email address associated with your account here. If there is another/better way to contact you, please indicate.

We are in contact with Louis and Ty


Unfortunately both Louis and Ty are not accessible. I have received the hybrid_gpu_hpl_v4 binary and some documentation for running on clusters (Guide_to_Running_HPL_on_NVIDIA_GPUs_v0.2.pdf). I cannot find documentation in how to run for best results on Haswell-EP systems with M GiBs of DRAM and 2 K80s per node (one on the same socket as the IB HCA the other on the other socket). Can you please highlight a rank geometry and basic logic to setup runs for hybrid_gpu_hpl_v4 with N nodes? I cannot specify exact numbers here. We can use OpenMPI 1.8.8 or 1.10.0.

Please contact my public email address if you could to let me know how to contact you directly.


I tried to contact you about 24 hours ago via the email address you use on these forums, but so far have not received a reply. I’ve also been in contact with Louis, he believes he is in contact with you.

We managed to actually get together and worked on some promising runs. Let’s see …



I’m trying to benchmark a cluster containing 20 nodes, each with 2 Tesla K80 GPUs. I am using linpack code available here: https://developer.nvidia.com/rdp/assets/cuda-accelerated-linpack-linux64 Unfortunately I’m not getting good results. Is there a linpack version available for Tesla K80 accelerators? If not, where could I get linpack which could work better with K80 GPU?

Best regards,