CUDA 4.2 + GeForce GTX 680 (Capability 3.0) + Fedora 14

Hi all,

I have a problem to run my CUDA program into my new GeForce GTX 680.

To establish a CUDA context, I use a call “cudaFree(0);” to force the creation of this context. It works in all of my GPUs (Teslas 1060, 2050, 2070, 2075, 2090 and GeForce GTX 580) except in the GeForce GTX 680 (the only one with CUDA Capability 3.0), where the call is blocked and paused the program execution.

If I comment this function call (“cudaFree(0);”), occurs the same thing in the first API CUDA call, the call is blocked but only when I am using the GeForce GTX 680.

Maybe is a driver problem?

NOTE: In the system is installed: Fedora 14 64bit + CUDA 4.2 (devdriver_295.41, Toolkit_4.2.9 and SDK_4.2.9).


Problem solved.

The program was blocked because we don’t use the last version of the CUDPP library. We change this to the 2.0 version and runs.But now, the problem is that the program runs more slowly with the new driver than with the old driver.
With the driver 270.41.19 is much faster than with the driver 285.05.33, 295.40, 295.41 and 295.49.

Is a driver problem? Any idea please?

Is it a question of driver or CUDA version?

The program runs your own kernels or just cudpp calls?

Try to isolate the slowdown to a kernel/cudpp call and let us know.


well, first thing to say is that kepler architecture does not work very well with cuda and gpgpu… some test says that a gtx 580 is faster than a gtx 680 in cuda programs, even if 680 has triple of cuda core… so, you can try to remove all the drivers, sdk and toolkit, and reinstall everything with the latest release, to make a cleanest installation… anyway, it is a little bit late to say, but you should prefer a 500 series to run cuda program instead a 600 series…

Hi again,

Thanks for your advices. I have made some tests, but what happens is very strange…

First, I have run the program in a GTX 580 with CUDA 4.1 and devdriver 270.41.19. Second, I have run the same progrma in a GTX 580 with CUDA 4.2 and NVIDIA Driver 295.53.

I have checked the computational time of one of the kernels, the CUDPP and the cudaMemcpy. In both cases, the computational times are very similars. But the final computational time of the program in the second case is five times upper than the first case.

The program code is the same in both cases, only change the driver, toolkit and SDK version…
Any idea? I’m absolutly disorientated…

Thank you very much.

do you benchmark your application with a “time <your_progam_here>” ?

Consider that if you are not runing an X server or any other application keeping drivers loaded then your program has to load the drivers first.

May be new drivers are loaded in more time than old one, try to running your benchmark with “time” keeping drivers loaded. To do so you can for example crate an almost empty main with a random cuda call (cudaFree(0) should be enough) followed by a long sleep.