cudaMemcpy DeviceToDevice and L2 cache usage

Hello!

There are a couple of posts, claiming that when using cudaMemcpy for Host-Device transfer, it is done through the GPU L2 cache:

https://forums.developer.nvidia.com/t/cudamemcpy-and-l2-cache/42817

https://stackoverflow.com/questions/34005844/gpu-l2-cache-hit-is-100-and-dram-load-transactions-sometimes-is-0

I was wondering if that’s the case when cudaMemcpy is used in a DeviceToDevice model in the same GPU, i.e using cudaMemcpy to copy from array A to array B, with A and B residing in GPU memory.

Thank you for your help!