cuBLAS fails when matrix has more than 2^31-1 entries?

Hi, I am using cuBLAS to multiply 2 single-precision matrices A and B.

It seems like whenever matrix A has more than approximately 2 billion elements (guessing ~2^31), the end of the output matrix is all zeros. If I reduce the size of the problem so that A has fewer than 2 billion elements total, the result is correct.

Does cuBLAS have a limit somewhere where it assumes array indices are 32-bit integers? Is there a way around this in cuBLAS?

There are explicitly published limits on the dimensions of matrices in the specification of the relevant interfaces, e.g.:

cublasSgemm(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k, ...

On all platforms supported by CUDA and CUBLAS, int is a signed 32-bit integer type that can represent values in [-231, 231-1].

Unless the maximum matrix dimensions are exceeded, or there is not enough GPU memory to hold the matrix, CUBLAS should work fine with a matrix of more than 232-1 elements, though I haven’t personally tried that as I don’t have a GPU with enough memory to hold an 8+GB matrix. If there is evidence to the contrary, I would consider that a bug, in which case you would want to file a bug report with NVIDIA.

Which CUBLAS function in particular do you observe failing for a matrix with more than 232-1 elements, and what are the actual dimensions of that matrix?

This is for cublasSgemm(). The large matrix A has 24000 rows and 100000 columns, so 2.4 billion elements. I get incorrect results and cuda-memcheck errors. If I change the problem so that A has fewer than 2 billion elements, the results are correct and the cuda-memcheck errors disappear. I’ll try to create a minimal demo example tomorrow.

CUDA 10.1 on a Titan RTX with 24 GB GPU memory.

What kind of cuda-mecheck errors? Is it possible that the allocation of the memory for the matrix fails?

While all allocator functions take a size_t argument, care must be taken to avoid integer overflow when passing in the size. Recommended pattern:sizeof (float) * 24000 * 100000. Incorrect pattern: 24000 * 100000 * sizeof (float). The latter pattern incurs an integer overflow prior to the conversion to size_t.

Thanks, I checked the size_t thing and that doesn’t seem to be the problem.

I created a minimal breaking example and posted it on GitHub: - maybe I made some other stupid mistake.

Funny thing is that if B has one column, it works, but once B has more than one column it breaks.

The cuda-memcheck output is in memcheck-output.txt.

Thanks so much for your help!

As I stated previously I can’t run with matrices that large on my system, neither GPU has enough memory. The cuda-memcheck output on github does not seem to correspond to your minimal test app, but your actual app, correct? Mostly of out-of-bounds accesses, one would need the source code to try to figure out why that is. I looked over your minimal test app and didn’t see anything out of place, but (fair warning!) I am terrible at spotting bugs in other people’s code.

Since you created a minimal app that repros the issue, you are now ready to file a bug report with NVIDIA.

Thanks. No the errors are actually from the minimal test app. I’ll file a report with NVIDIA.

Bug report is here:

Note that for reasons of confidentiality, only the filer and relevant NVIDIA personnel have access to a bug report. But publishing the bug number here may be useful for NVIDIA moderators later on.

Oh, I see. Thanks!

The bug report gave immediate results! Apparently the NVIDIA representative could reproduce the bug on CUDA 10.2, but it works as it should at 11.0. So that’s that I guess. Hopefully I can test it myself next week or so.

Just upgraded our server to CUDA 11.0. The cublasSgemm call now works as expected for all input sizes.