Cublas Launch timeouts

Hi all,

My program reads two large matrices from files on the hard disk, transfers them into memory and multiplies them using cublasSgemm(). The program runs fine for relatively small matrices. But crashes for larger matrices at the cublasSgemm() call, with the error message, “Cuda error: the launch timed out and was terminated”(I was reading the cublas error before, but it just says CUBLAS_INTERNAL_ERROR… So i called cudaGetLastError() instead).

I am doing this on a 9400 GT and CUDA 2.0…

I found an old thread which discusses this same error. The suggestions that were given on that thread about what could be causing this problem were…

– Since my code only has cublas functions and no cuda kernels, the .cubin file shows no details about the shared memory or Register usage. Does anyone know how I can find how much registers/shared mem the cublas function might be using.?? Regardless, I did compile my code on nvcc using the -maxrregcount as suggested with a max of 4 registers… (Thats the minimum right). But that doesn’t fix the problem.

– About the 5 s watchdog mechanism, I am driving my display with a seperate graphics card (an FX 5200) and running cuda on another. So the watchdog shouldn’t be an issue right.

– The only other problem I could think of was that, i am putting too much data onto the graphics card and so the cublasSgemm() doesn’t run because of insufficient memory. But the code crashes for runs in which the total data should only take up 0.27 GB by my calculation. Also the program crashes on and off for smaller matrices too… So that doesn’t seem to be the issue either.

I am out of any other ideas. I would really appreciate any suggestion you might have about this…

Some specifics about the data sizes that I was running in the program… All are matrices of floats.

A 44102010 matrix x 12322010 matrix - Always runs successfully.

A 44102010 matrix x 70002010 matrix - Runs successfully most of the time. But crashes sometimes.

A 44102010 matrix x 90002010 matrix - Crashes all the time. Ran once after i compiled the code with -maxregcount 4. But crashed when i tried to run it immediately again… So dont think that was related.

Crashes for all bigger matrices.

One other detail in case its relevant… I get an error message printed out only when I run the code in the CUDA profiler. If I run it from the VC++ command prompt, the program simply runs completely but returns an empty product matrix.

Please help me out…



What driver are you using? If you can use an FX 5200 and a 9400 GT at the same time, you are using a very old driver. FX 5200 wasn’t even supported in R177.

I am using the “GeForce Release 175 WHQL Version: 175.19” The release notes says that it supports GEForce 6, 7, 8 and 9 series (and not the 5 series that I am using)… But this was the driver that the Nvidia website pointed me to when i searched for a driver for the FX5200.


175.19 doesn’t officially support 2.0 (there are no drivers that support CUDA 2.0 and the FX 5200 series) and more importantly, that probably has the watchdog timer bug where it wouldn’t be disabled on non-display cards. This was fixed back in July or something, though…

Thats surprising… But i have been running simple programs on this card without having any issues for the past month… :unsure:

But the watchdog timer not being disabled on non display cards would explain all of the troubles that I am having. Can you think of any way in which I can work this out? I only have one PCI-e slot on my board, so upgrading my display graphics card to a higher one isn’t possible.

How can i know which are the driver versions on which the 5s watchdog timer will get disabled on non display cards?


Run Linux, boot directly to a console, and voila, no watchdog. Use ssh to control the machine.

Thanks for your help… Unfortunately, I do not have access to a linux system with an Nvidia card right now… :(
I am going to try and see if I can get a live linux version and try out this program from that… But just to confirm, is there no driver which will run both my graphics cards and have the watchguard timer diabled…? If that is the case, it kinda defeats the whole purpose of me having bought a second graphics card… :(


A decent GPU will execute the SGEMM in less than 1.5 sec (times are including I/O):

On an 8800 GT:
4410 2010 7000 time = 1.2761s MFLOPS= 97250.2188

On a Tesla:
4410 2010 7000 time = 0.7039s MFLOPS= 176301.5938

Also, SGEMM has a fast path if (M%64)==0, (K%16)==0, (N%16)==0. you may want to pad your matrices:
4416 2016 7008 time = 0.3799s MFLOPS= 328416.8125

Thanks for the tip Mfatica. I ll keep that in mind…

But I am confused now… In that case, it shouldn’t trigger the 5 s watchdog timer even if it is enabled right? What else could be causing the launch timeouts then…?

8800 GT and Tesla are both significantly faster than the 9400 GT.

Your card has only 16 processors and a 128 bit memory interface, you should get a better card if you really want to use CUDA for production runs.
The one you have is fine to develop and check results, but that is pretty much it.

Ah… ok. My bad… Thanks…