The source code for the CUDA accelerated Linpack is now available to all registered developers.
The code has been released under BSD license.
- There is NO support for the code ( the CUDA_LINPACK_README.txt has detailed instructions ).
- The code requires a Fermi card (It uses a fast DGEMM implementation written in Fermi assembler) with more than 2GB of memory ( all the Tesla 20x0 will qualify)
- The library that intercepts the DGEMM and DTRSM calls could easily be used in other codes that are DGEMM intensive.
- The code requires CUDA 4.0 and it is Linux only.
This presentation has a description of the implementation details: http://www.nvidia.com/content/GTC-2010/pdfs/2057_GTC2010.pdf
This is the same code used for several Top500 runs, it is well tested and known to run with several thousands GPUs.