have you tried a cudaDeviceSynchronize() following the kernel call?
Quoting Robert Crovella, from another thread