There are a couple of posts, claiming that when using cudaMemcpy for Host-Device transfer, it is done through the GPU L2 cache:
I was wondering if that’s the case when cudaMemcpy is used in a DeviceToDevice model in the same GPU, i.e using cudaMemcpy to copy from array A to array B, with A and B residing in GPU memory.
Thank you for your help!