Hello everybody,

I am enjoying benchmarking of Nvidia’s GPGPUs with HPL distributed at the Registered Developer Program. However I found a unexpected behavior when benchmarking huge problems. It returns NaNs and the “unspecified launch failure” device access errors with configurations such that N / P * NB > 128000.

I have investigated this problem and found a bug at the CUBLAS_DGEMM_MF function of cuda_dgemm.c. The m_gpu variable is a size of a workload to be processed with a GPU, and the mmax variable is a size suitable to memory of a GPU. The m_gpu workload seems planned to be divided to the mmax tasks by the iter_m loop.

Unfortunately, several routines still use not mmax but m_gpu as a size of arrays. So the data size on a GPU gets m_gpu * NB regardless of an amount of GPU memory, then the computation errs.

I wish this problem corrected. And I will be glad I could help solving it.


If you are using Fermi GPUs, there is a limitation in a texture load path when using the fast DGEMM.
N/P needs to be less than 128000.

Hi Ryohei, it looks like the default buffer sizes are too small for this case.

the total GPU memory used by buffers is only 2 GB, if your GPU has more than 2GB you can try increase the buffer size.

for example to double buffer size (requires GPU with at least 4 GB memory) :

cuda_dgemm.c : line 252, insert new line:

scratch_size_A *= 2; scratch_size *= 2;

Thank you for your advises.

My PC with GPGPUs being benchmarked is in maintenance by a trouble with power supply, so I cannot test it soon. But I have considered your reports are right. After the maintenance, I expect the HPL runs exactly using your patch.


I tested the patch since the power supply repaired. The HPL problems got solved!

Thank you for your solutions.