CuBLAS GEMM leads to invalid reads for some matrix dimensions

Hi all,

I have a problem with CuBLAS sgemm. When I try to compute C := A*B + C (scales omitted), and the matrices are of sizes:
A: m x k,
B: k x n,
C: m x n,
cuda-memcheck reveals that there are invalid reads for some specific matrix dimensions. For m=426, k=2, n=2305, I have those invalid reads, if I have k>=8, they do not occur anymore.
I already posted the question in stackoverflow, but since it is a very specific cuBLAS question, I think here might be more people who know the answer.

Please find a more detailed description, a toy-example, and the output of cuda-memcheck here:

I would be glad, if someone of you has a solution to that problem.

Thanks a lot,


For dimensions that give rise to the out-of-bounds messages from cuda-memcheck, do you get incorrect results as well? In my experience, cuda-memcheck diagnoses out-of-bounds conditions accurately on recent architectures, so I would think that out-of-bounds accesses actually happen when they are being reported. Such accesses might be harmless in the sense that data read during an out-of-bounds access may not enter the computation of the results, e.g. a kind of “overfetch” scenario, where overfetched data is discarded at a later stage.

If you are getting incorrect results, you would definitely want to file a bug. If the results are correct, it would still be helpful to file a bug so the CUBLAS team can eliminate the out-of-bounds accesses going forward. Please attach self-contained repro code to bug reports. Thank you for your help.

“For dimensions that give rise to the out-of-bounds messages from cuda-memcheck, do you get incorrect results as well?”

No, the results are correct. But in some cases it happens that I cannot read the matrix after the call to sgemm. Then, CUDA tells me the pointer points to an invalid address on the device memory (unless the value of the pointer did not change…)

Moreover, the error seems to be dependend on the hardware. I tested it on a notebook with the same operating system and cuda version, but with a GTX 760M instead of a GTX 780. On that machine, it works properly.

Ok, updated information:

I have two graphic cards in my machine, one for X, another one for Cuda applications.
I killed X and ran the code on the card I usually use for X. No problems. So, I think it is not a problem with the libraries, but a problem with cuBLAS in combination with the GTX 780. If you have any ideas how I could solve the issue, please feel free to post your suggestions.

Else, I think I’ll file a bug.

Thanks a lot in advance!

I was able to reproduce this issue (eventually) and have filed a bug (1532781) with NVIDIA.

@txbob: Thanks for reporting it, it seems Nvidia fixed it!

Just to add some information for googlers from the future: We apparently got trapped by the same bug in CUBLAS (as documented here: A particular cublasSgemm call always failed with CUBLAS_STATUS_EXECUTION_FAILED (when run with CUDA_LAUNCH_BLOCKING=1), although all arguments seemed correct. Investigating the matter, we found another cublasSgemm call that gives the correct result, but is shown to perform invalid reads when executed with cuda-memcheck, just as observed by AlRich.

In our case, the call was:

cublasSgemm(handle=0x75f1bc0, transa='N', transb='T',
     m=4096, n=512, k=5,
     A=0xb00700000, lda=4096,
     B=0xb00504800, ldb=512,
     C=0xb00ac0000, ldc=4096)

And the first error message by cuda-memcheck says (to help anybody googling for it):

========= Invalid __global__ read of size 4
========= at 0x000000e0 in sgemm_sm_heavy_nt_ldg
========= by thread (255,0,0) in block (0,0,0)
========= Address 0xb0050807c is out of bounds

Doing the math, this access is clearly out of range of the 512x5 matrix B.

With some more tests, we found that this only occurs on GPUs of the 700 series (including Tesla and TITAN) and Compute Capability 3.5. With nvprof we could see that this is because those GPUs use sgemm_sm35_ldg_nt_128x16x64x16x16 and sgemm_sm_heavy_nt_ldg for this specific dot product, while other GPUs use gemm_kernel1x1_core (to throw in two more keywords for search engines).
Testing some more, we found that the bug occurs in CUDA 5.0, 5.5, and 6.0, independently of the driver version. It does not occur in 4.2 (before the introduction of Compute Capability 3.5), and it seems to have been fixed in CUDA 6.5 released recently.

Yes, the fix should be in CUDA 6.5 production release (it was not in CUDA 6.5 RC).