Cublas sgemv crashes on Ampere cards...runs flawlessly on other card types

This is a strange issue. I have some simple cublasSgemv calls which run fine over hundreds of thousands of iterations on a 1080ti, 2080ti, etc. However, they crash after only a few calls on an RTX 3090 (compiled with CUDA 11.2). Same data, but Ampere crashes.

I have double, triple and quadruple checked the size of my input arrays, and values of all other arguments…and the fact that the code runs fine over huge datasets for so many iterations on non-Ampere architecture leads me to believe this is a cublas Ampere-related bug.

NSight Compute returns the following API Stream results during the crash (with “cudaErrorInvalidResourceHandle” reported as the error generated during the call to cublasSgemv):

Any ideas what’s going on? What’s even stranger is that if I isolate the exact data sent to cublasSgemv (the matrix/vector/result arrays, filled with the same data that causes the crash in the larger application) and compile it into its own application, no crash occurs. So perhaps the cublas library is doing some internal, opaque allocations that cause an issue after a while on Ampere? From the API stack in the screenshot I can see Ampere-specific functions are being called, so it’s not a stretch to believe that an Ampere-related bug is within the realm of possibilities…

Ok, perhaps a false alarm.

I thought I was running CUDA 11.2.2, but was only running 11.2. Upgraded to 11.2.2 and the crashing no longer occurs, at least in initial tests.

Ok, that was a false-false alarm because the issue has happened again, although it’s not happening in the same circumstance and is a rarer occurrence.

Here is a screenshot of the Nsight Compute results, during execution with CUDA 11.2.2:

It’s the same invalid handle error happening within cublasSgemv, on an RTX 3090 using CUDA 11.2.2.

Can you provide your example code?

I’ve tried to breakout the code into a smaller reproducible example but can’t get it to crash in a simple app that calls cublasSgemv with the same data (as explained in the OP). The issue happens predictably and regularly within my larger application though. The larger application is 3ds Max and my code is a plugin for it. Once Max is installed and my plugin is loaded, it takes only a few seconds to trigger the crash in CUDA 11.2, and a couple of minutes in 11.2.2 on an RTX 3090. I could provide more detailed steps for reproducing the issue if you’d be interested in exploring this avenue of testing.

Here’s the code in question

 if (cublas_handle)
 {
          
                 tfcErrorCheck((cublasSgemv(
                    cublas_handle, //confirmed valid
                    CUBLAS_OP_N, 
                    M, //value between 50-200 when crash occurs
                    N, //value between 50-200 when crash occurs
                    alpha, //host pointer with value of 1
                    mat1, //M*N matrix
                    M,
                    mat2, //N vector
                    1,
                    beta, //host pointer with value of 0
                    matRes, //M vector
                    1)));                   
                                    
            }

I’ve quadruple checked the sizes of mat1/mat2/matRes and even if I vastly over-allocate them (initializing the extra buffer to 0s) the issue still occurs (and as mentioned before, this code never triggers an issue on non-Ampere devices, in my own testing, after thousands of iterations using the same data).

Yes please, we’ll need a reproducer to file a bug.

Can you try logging error with cuBLAS
CUBLASLT_LOG_LEVEL=1

Will cublasLt logging affect cublas calls?

Yes, cublas runs cublasLt.