CUDA accelerated Linpack code available

The source code for the CUDA accelerated Linpack is now available to all registered developers.

The code has been released under BSD license.

Few remarks:

  1. There is NO support for the code ( the CUDA_LINPACK_README.txt has detailed instructions ).
  2. The code requires a Fermi card (It uses a fast DGEMM implementation written in Fermi assembler) with more than 2GB of memory ( all the Tesla 20x0 will qualify)
  3. The library that intercepts the DGEMM and DTRSM calls could easily be used in other codes that are DGEMM intensive.
  4. The code requires CUDA 4.0 and it is Linux only.

This presentation has a description of the implementation details:

This is the same code used for several Top500 runs, it is well tested and known to run with several thousands GPUs.

Dear Dr. Fatica,

Clould you give me the link to find CUDA-enabled version of HPL optimized for Tesla 20-series GPUs?

Thanks in advance

It is available from the CUDA registered developer web site:

If you don’t have an account, you will need to register.

Thanks again!

Dr. Fatica,

After registration, where exactly can we find the download-able code? Could you please point to a URL?


(1) Log into NVIDIA’s Registered Developer website at
(2) On the right-hand side of the starting page, you will see a column titled “Newest Documents And Downloads”
(3) The second item from the top is a link for “CUDA accelerated Linpack”

“2) The code requires a Fermi card (It uses a fast DGEMM implementation written in Fermi assembler) with more than 2GB of memory ( all the Tesla 20x0 will qualify)”

I have tried this code on a small cluster with multiple GTX 580’s/590 per node. 1.5GB ea. It appears to work.
What problems would you expect with the smaller memory? Would performance be expected to improve with a larger GPU memory?

I also note that I am using smaller node memories as well. My Impression is that adding more memory by either increasing the number of nodes or increasing the amount of memory on all nodes will allow me to solve larger array’s with resulting higher FMAX. Can you use different memory sizes on individual nodes or will it be limited to the size of the smallest node?

Dr. Bahr


I have registered on the nvdeveloper zone, yet, it does not give me access. I am looking to download and run the Linpack code that was shown in the presentation by E. Phillips and M. Fatic. Can someone please help me here. Thanks!

unable to login here: even though i have a registered email address and password. Thanks


I have the same problem. I have created the account, but the link provided here won’t recognise my login and password. I think it points to an old developer’s site, hence this issue.

Could you please provide the instructions on how to obtain the GPU LINPACK code on the new (current) developers site?

Thank you in advance.

EDIT: Never mind, I found the new link:

This paper describes the use of CUDA to accelerate the Linpack benchmark on heterogenous clusters, where both CPUs and GPUs are used in synergy with minor or no modifications to the original source code. A host library intercepts the calls to DGEMM and DTRSM and executes them simultaneously on both GPUs and CPU cores. An 8U cluster is able to sustain more than a Teraflop using a CUDA accelerated version of HPL.

Best advantages of technology
New Best Technology
Best science technology

Excuse me, I can not find the source code on the developer’ site. Could you send me a copy to my email?

My email address is, thank you in advance.

(1) Login at
(2) Click on green link “CUDA/GPU Computing Registered Developer Program”
(3) Look for “CUDA Accelerated Linpack”, and click the link there
(4) Click green “I accept” string at the bottom of the license agreement

At this point a download window should pop up. I just exercised the entire process using my own registered developer account, so it should work.

Hi, I am unable to access Linpack, could you please provide me with one? thanks!

I just exercised it with my own developer account and it worked perfectly, would really like to have a look at this superb piece of code and it’s performance-level!

It’s possible running CUDA linpack on Kepler k20x and cuda 5.5 ?