2 blocks driver 304.54 hang? intermittant 295GTX 64bit linux centos 5.

Has anyone else experience problems using small grids?
The GPU has 30 SM and every so often I get a problem which eventually leads to
the unix host process being terminated after 4 minutes (limit cputime 200).
Surprisingly the partial output generated suggests that the previous kernel
has terminated without error and the next one not started yet. Ie it appears
the problem is manifest when the program is in the host code, which is why
I wondered if it might be a driver rather than kernel problem.
The problem is undiagnoised:-( but seems to be associated with using only
1, 2 or 4 of the SM (grid = (1,1) (2,1) (2,1) and (2,2) )
(The thread blocks are (128,1) (192,1) (256,1) (384,1) and (512,1),
more than 320 threads are suposidly not used by the kernel.)

When run with cuda memcheck, no errors are reported.
I cant see any addressing errors.

As always any help or comments most welcome
Thank you

Any thoughts on how I might diagnose this?