Partial solution to the driver problems

I have made some progress understanding the driver problems.

One key observation is that one can only launch kernels with a limited number of thread blocks without compromising the stability of the machine.

foo<<<M, N>>>(…);

The limitations are: M <= 16 and N <= 256

With M=16 and N=256 I was able to run all my simulations. Although this is not an optimal kernel configuration, at least I get now a performance indication of the double precision hardware.

My tests indicate that increasing the number of thread blocks M increases the probability of the machine locking up. So there seems to be a bug in the thread batching system of the driver.

Please report your experience with the driver instability!

I’ve been running kernels with 12500 blocks, each with 128 threads, so far with no problem on the T10P. (The slower version with 192 stream processors and 1.08 GHz clock.)

I have a much more complicated kernel that generates an unspecified launch failure if I compile it with -arch sm_13, but I haven’t had time to track down the cause of that yet. The kernel runs fine if I compile with sm_10. So far no system lockups yet.

(RHEL5, 64-bit Linux)

Thank you for your reply!

Since I am primarily testing double precision algorithms I have to use the -arch sm_13 option. I even rewrote some of my kernels to fit the foo<<<M, N>>>(…); pattern. The driver instability occurs when one launches hundreds or thousands of these kernels during the execution of the algorithm. Launching only a few kernels does not cause problems most of the time.

It seems that you have the same hardware. Can you test some of the CUDA SDK samples? (alignedTypes, particles should crash the machine.)

It’s possible that this could be a problem with your specific board, or of an insufficient power supply. Can you try the board in another machine with a beefier PSU?


I’ve had similar problems with my D10P using -sm_13 and kernels using doubles. I was not able to run reliably with grids larger than <<<32,16>>> (my kernel uses a lot of registers). I filed a bug with Nvidia; they reproduced the bug using my code and said that the next driver to be released after 177.11 will fix my problem. Let’s wait and hope :)

The board is stable with much larger grids when using floats instead of doubles. With -sm_11 or -sm_13, I can get up to <<<32,64>>> which is where I expect optimal performance based on register usage.

I even installed an additional power supply, delivering 250W exclusively to the Tesla board. I did some stress testing and ran my simulation for 30 minutes and everything worked fine. So the system seems to be ok.

But the stable kernel configurations are not really deterministic. After some small change to the algorithm even the limited kernel configuration from above is not stable any more.

Since I cannot get any of my simulations working in a reliable way, I suspect there is a hardware problem with the board.

I am getting now a replacement board from Nvidia.


Could you try the alignedTypes sample from the CUDA SDK 2.0b and see if it looks up your system?

In any case I have now requested a replacement board from Nvidia.

I just ran alignedTypes twice and it runs. It reports “TEST FAILED” for all of the subtests, with a copy throughput of around 8.6 GB/s.