I have made some progress understanding the driver problems.

One key observation is that one can only launch kernels with a limited number of thread blocks without compromising the stability of the machine.

foo<<<M, N>>>(…);

The limitations are: M <= 16 and N <= 256

With M=16 and N=256 I was able to run all my simulations. Although this is not an optimal kernel configuration, at least I get now a performance indication of the double precision hardware.

My tests indicate that increasing the number of thread blocks M increases the probability of the machine locking up. So there seems to be a bug in the thread batching system of the driver.

Please report your experience with the driver instability!