HELP: cuda runtime initialization takes up to minutes

Hello,

I am experiencing a really annoying problem concerning my development of cuda ray tracer for non-linear ray tracing.

Up to now i was using a GTX 275 and cuda toolkit 3.2 on Ubuntu Lucid 10.10 and everything went fine. The time needed for a startup of the application take approximately 1 till 3 seconds
including memory transfer and allocation of several hundreds of MBs (and additional OpenGL initialization, All cards are bound to an Xserver).

For testing purposes i switched to 2 newer GTX 580 systems (the speedup for calculation is incredible) running Ubuntu and Fedora respectively and the cuda toolkit 3.2 as well.

On these systems the application gets stuck at the creation of the cuda context for several minutes. After a real long period of time the application suddenly returns and runs further as usual. First i thought this has something to do with allocation and copying image data to texture memory, but after some time of searching for the problem i found out, that the waiting time is related to the first call to a cuda runtime function that needs the cuda context and therefore initializes it.

Running cudaSetDevice() and cudaThreadExit() as first calls to the runtime library get executed really quickly but calling cudaFree(0) or cudaThreadSynchronize() first causes this long time being stucked in whatever CUDA is doing internally. After taking a cup of tee the application is up and running. On the Fedora system quitting and executing again does not suffer from this long period of waiting but recompiling the application before executing brings back the problem.
Curiously this behavor does not occur on the GTX275 so i think it has to be a driver issue. I should have mentioned that my executable is really large (~9MB in size) but it does fine on the GTX275 and on following startups on the GTX580.

Does anybody has a clue, what is going on there - or is anybody experiencing the same behavor ?

kind regards,
daniel

Hello,

I am experiencing a really annoying problem concerning my development of cuda ray tracer for non-linear ray tracing.

Up to now i was using a GTX 275 and cuda toolkit 3.2 on Ubuntu Lucid 10.10 and everything went fine. The time needed for a startup of the application take approximately 1 till 3 seconds
including memory transfer and allocation of several hundreds of MBs (and additional OpenGL initialization, All cards are bound to an Xserver).

For testing purposes i switched to 2 newer GTX 580 systems (the speedup for calculation is incredible) running Ubuntu and Fedora respectively and the cuda toolkit 3.2 as well.

On these systems the application gets stuck at the creation of the cuda context for several minutes. After a real long period of time the application suddenly returns and runs further as usual. First i thought this has something to do with allocation and copying image data to texture memory, but after some time of searching for the problem i found out, that the waiting time is related to the first call to a cuda runtime function that needs the cuda context and therefore initializes it.

Running cudaSetDevice() and cudaThreadExit() as first calls to the runtime library get executed really quickly but calling cudaFree(0) or cudaThreadSynchronize() first causes this long time being stucked in whatever CUDA is doing internally. After taking a cup of tee the application is up and running. On the Fedora system quitting and executing again does not suffer from this long period of waiting but recompiling the application before executing brings back the problem.
Curiously this behavor does not occur on the GTX275 so i think it has to be a driver issue. I should have mentioned that my executable is really large (~9MB in size) but it does fine on the GTX275 and on following startups on the GTX580.

Does anybody has a clue, what is going on there - or is anybody experiencing the same behavor ?

kind regards,
daniel

Ok for everybody who’s experiencing a long startup time in cuda application too, i found the reason for my case !

The long time to start the application was spend to compile the embedded virtual ptx code (-code=compute_13) to binary gpu-code ( see just-in-time compilation in the nvcc manual ).
I wasn’t aware to the fact ( shame on me ! ) that GTX 4xx cards and later are already based on the fermi architecture, which needs code for compute capability 2.0. I adviced nvcc to only generate cubin code for sm_13:

-gencode=arch=compute_13,code="compute_13,sm_13"

the “compute_13” advices nvcc to embed ptx code into the executable, the “sm_13” specifies directly executable machine code.

On the older GTX 275 the sm_13 code was adequate for direct execution on its Gpu.
The newer GTX 580 needs to have sm_20 embedded, but the driver is able to compile the virtual ptx code “just-in-time”, which actually was “just-a-long-time” in my case …
Onces the code is jit-compiled the driver holds the binaries in its cache so further startups are as fast as usual.