This happens only if the program is being profiled under nvprof. I do call cudaDeviceSynchronize() before calling cudaProfilerStop(). The hang happens either if I explicitly call cudaProfilerStop() or cudaDeviceReset(). I free any previous memory allocation before attempting a cudaProfilerStop(). Here is what the profile output looks like http://i.imgur.com/mD1nPBG.png : you can see I upload some memory, run kernels, download the result to host, then there’s the huge blank space which is the host thread blocked. The last “Runtime API” command is cudaDeviceReset(), but the fact that it appears after the huge blank is bogus, I could confirm that the host thread is already in cudaDeviceReset() the whole time.