I have made some progress understanding the driver problems.
One key observation is that one can only launch kernels with a limited number of thread blocks without compromising the stability of the machine.
The limitations are: M <= 16 and N <= 256
With M=16 and N=256 I was able to run all my simulations. Although this is not an optimal kernel configuration, at least I get now a performance indication of the double precision hardware.
My tests indicate that increasing the number of thread blocks M increases the probability of the machine locking up. So there seems to be a bug in the thread batching system of the driver.
Please report your experience with the driver instability!