Hi, I am using cuBLAS to multiply 2 single-precision matrices A and B.
It seems like whenever matrix A has more than approximately 2 billion elements (guessing ~2^31), the end of the output matrix is all zeros. If I reduce the size of the problem so that A has fewer than 2 billion elements total, the result is correct.
Does cuBLAS have a limit somewhere where it assumes array indices are 32-bit integers? Is there a way around this in cuBLAS?
There are explicitly published limits on the dimensions of matrices in the specification of the relevant interfaces, e.g.:
cublasSgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k, ...
On all platforms supported by CUDA and CUBLAS, int is a signed 32-bit integer type that can represent values in [-231, 231-1].
Unless the maximum matrix dimensions are exceeded, or there is not enough GPU memory to hold the matrix, CUBLAS should work fine with a matrix of more than 232-1 elements, though I haven’t personally tried that as I don’t have a GPU with enough memory to hold an 8+GB matrix. If there is evidence to the contrary, I would consider that a bug, in which case you would want to file a bug report with NVIDIA.
Which CUBLAS function in particular do you observe failing for a matrix with more than 232-1 elements, and what are the actual dimensions of that matrix?
This is for cublasSgemm(). The large matrix A has 24000 rows and 100000 columns, so 2.4 billion elements. I get incorrect results and cuda-memcheck errors. If I change the problem so that A has fewer than 2 billion elements, the results are correct and the cuda-memcheck errors disappear. I’ll try to create a minimal demo example tomorrow.
What kind of cuda-mecheck errors? Is it possible that the allocation of the memory for the matrix fails?
While all allocator functions take a size_t argument, care must be taken to avoid integer overflow when passing in the size. Recommended pattern:sizeof (float) * 24000 * 100000. Incorrect pattern: 24000 * 100000 * sizeof (float). The latter pattern incurs an integer overflow prior to the conversion to size_t.
As I stated previously I can’t run with matrices that large on my system, neither GPU has enough memory. The cuda-memcheck output on github does not seem to correspond to your minimal test app, but your actual app, correct? Mostly of out-of-bounds accesses, one would need the source code to try to figure out why that is. I looked over your minimal test app and didn’t see anything out of place, but (fair warning!) I am terrible at spotting bugs in other people’s code.
Since you created a minimal app that repros the issue, you are now ready to file a bug report with NVIDIA.
Note that for reasons of confidentiality, only the filer and relevant NVIDIA personnel have access to a bug report. But publishing the bug number here may be useful for NVIDIA moderators later on.
The bug report gave immediate results! Apparently the NVIDIA representative could reproduce the bug on CUDA 10.2, but it works as it should at 11.0. So that’s that I guess. Hopefully I can test it myself next week or so.