cross posting: gpu - Compute and Data transfer not happening concurrently in cuda Streams on Iteration 2 - Stack Overflow
it may be this: Persistent Kernel does not work properly on some GPUs - #5 by Robert_Crovella