Got out of memory from cudaMemcpy

I see the same issue on my CUDA 11.3 machine. I suggest filing a bug.

You already have a possible workaround; do cudaSetDevice() for the relevant pointer before the cudaMemcpy() operation.