I am enjoying benchmarking of Nvidia’s GPGPUs with HPL distributed at the Registered Developer Program. However I found a unexpected behavior when benchmarking huge problems. It returns NaNs and the “unspecified launch failure” device access errors with configurations such that N / P * NB > 128000.
I have investigated this problem and found a bug at the CUBLAS_DGEMM_MF function of cuda_dgemm.c. The m_gpu variable is a size of a workload to be processed with a GPU, and the mmax variable is a size suitable to memory of a GPU. The m_gpu workload seems planned to be divided to the mmax tasks by the iter_m loop.
Unfortunately, several routines still use not mmax but m_gpu as a size of arrays. So the data size on a GPU gets m_gpu * NB regardless of an amount of GPU memory, then the computation errs.
I wish this problem corrected. And I will be glad I could help solving it.