Error 13 from cublas Sgemv() call using 680m, but no error when using Tesla K20

Have a CUDA mex file which I am accessing from MATLAB which works fine when I am using a K20, but when the same code on a different laptop(compiled with compute 3.0 instead of 3.5) I get the infamous INTERNAL_ERROR from a cublas Sgemv() call.

I searched already the forums and understand in general terms when this error shows up, but in this case the code works fine on the 3.5 machine, but crashes with a 3.0.

The code gets through all the memory allocations, and at least 10-15 Sgemm() and Strsm() before it exits from a Sgemv() call.

Every GPU command in the code checks for errors, and it works fine on the K20, but not the 680 in the laptop.

This error seems very general, so what are some possible causes in this case, which would show up only with the 680 but not the K20?

The 680 is the only GPU in the laptop running W7, Visual Studio 2010 and Matlab 2011b.

Unless I am looking at the wrong header file, error 13 is EXECUTION_FAILED. This means the GPU kernel inside the CUBLAS call failed. Likely causes: unspecified launch failure, or timeout (kernel killed by watchdog timer).

Based on your description, I would think you are hitting the latter. You could try smaller matrix sizes to confirm, or use a GPU that is not running the display (the operating system watchdog timer is there to ensure the GUI doesn’t freeze for more than a specified time limit, usually 2-5 seconds).

The K20 never drives a display, so there is no watchdog timer associated with it and you can therefore run compute kernels of any duration you desire.

Still having this issue, even though I did adjust the watchdog timer in the registry.

The data size is not super large , the matrix A is 768 x 128, but still get the same error on the Sgemv() call.

This a link to the exact laptop machine which is causing the issue:

When I adjust the mex interface to just do a single Sgemv() it works fine, and this particular code runs without error from MATLAB on two other different desktops with a Tesla K20 using the TCC driver.

So I am assuming there may be some system wide interrupt or interference to the call, since there is only that one GPU and which also has the active video out.

What else can I do to narrow down the problem?

Generally speaking, I would triple check the pointers and other arguments passed into cublasSgemv(). If the kernel does not die due to a watchdog timeout, it is probably being killed by an unspecified launch failure which indicates operating on memory that is out of bounds. This could be due to a bad pointer, an incorrect transpose mode, or an inadvertent switch of the dimensions.

The other angle of attack is the fact that it works fine with a single call to cublasSgemv(). So does it work with 2, 3, …, n calls? If not, what is the smallest n for which it does not work? What are the salient differences between running with n-1 and n calls? For example, are the matrices passed in different calls all the same size? I would cut the failing case down to a minimum of code and it will likely become apparent what the issue is.

Correct me if I am wrong, but GTX 680 is an sm_30 device, while Tesla K20 is an sm_35 device. This means that the kernels invoked by cublasSgemv() are physically different for the two platforms. They should be functionally equivalent though, unless there is a bug. I would consider a bug in reasonably common CUBLAS functions unlikely at this stage, but of course the possibility can never be excluded. Therefore I would focus initial investigation on the validity of the inputs passed into cublasSgemv().