Is there a way to do a memcpy from a device to another, avoiding intermediate GPU->CPU/CPU->GPU transfers?
No
Even though you can’t go direct, you can mark the intermediate buffer on the CPU as page-locked for both GPU contexts to ensure maximum speed. Search for cudaHostAllocPortable in the programming guide.