What exactly does the managed memory flag do and what changes?

have you tried a cudaDeviceSynchronize() following the kernel call?

Quoting Robert Crovella, from another thread