In running my cuda application, I’ve noticed that it takes almost no time at all to copy memory to the device (specifically, less than 1 ms for 356KB) but is very slow at copying memory back to the host (specifically, it takes ~47 ms for only 4KB). This just seems really slow to me. I am using streams that write to the same memory (but at the time it’s copying, I make sure that no stream will write to that small region of memory). Does memcpy take into account how streams may or may not use the memory it’s copying? I am not using cudamemcpyasync to copy the memory to the host because I need to know as soon as I get it. And yes, when I tested the memcpy, I did not use the async version for timing. I know it’s not just a timing issue because the overall execution of the program is slow, and I have targeted it to that section. So whether or not it’s exactly ~47 ms doesn’t matter, I am just wondering why it is so much slower (at best 47 times worse for memory 1/89 the size) and if there’s anything I could do to speed it up (besides waiting a bit longer and copying more data from one memcpy).