Hi all,
My program reads two large matrices from files on the hard disk, transfers them into memory and multiplies them using cublasSgemm(). The program runs fine for relatively small matrices. But crashes for larger matrices at the cublasSgemm() call, with the error message, “Cuda error: the launch timed out and was terminated”(I was reading the cublas error before, but it just says CUBLAS_INTERNAL_ERROR… So i called cudaGetLastError() instead).
I am doing this on a 9400 GT and CUDA 2.0…
I found an old thread which discusses this same error. http://forums.nvidia.com/lofiversion/index.php?t39652.html. The suggestions that were given on that thread about what could be causing this problem were…
– Since my code only has cublas functions and no cuda kernels, the .cubin file shows no details about the shared memory or Register usage. Does anyone know how I can find how much registers/shared mem the cublas function might be using.?? Regardless, I did compile my code on nvcc using the -maxrregcount as suggested with a max of 4 registers… (Thats the minimum right). But that doesn’t fix the problem.
– About the 5 s watchdog mechanism, I am driving my display with a seperate graphics card (an FX 5200) and running cuda on another. So the watchdog shouldn’t be an issue right.
– The only other problem I could think of was that, i am putting too much data onto the graphics card and so the cublasSgemm() doesn’t run because of insufficient memory. But the code crashes for runs in which the total data should only take up 0.27 GB by my calculation. Also the program crashes on and off for smaller matrices too… So that doesn’t seem to be the issue either.
I am out of any other ideas. I would really appreciate any suggestion you might have about this…
Some specifics about the data sizes that I was running in the program… All are matrices of floats.
A 44102010 matrix x 12322010 matrix - Always runs successfully.
A 44102010 matrix x 70002010 matrix - Runs successfully most of the time. But crashes sometimes.
A 44102010 matrix x 90002010 matrix - Crashes all the time. Ran once after i compiled the code with -maxregcount 4. But crashed when i tried to run it immediately again… So dont think that was related.
Crashes for all bigger matrices.
One other detail in case its relevant… I get an error message printed out only when I run the code in the CUDA profiler. If I run it from the VC++ command prompt, the program simply runs completely but returns an empty product matrix.
Please help me out…
thanks,
Avinash