When using CUDA API, the cuMemcpyDtoHAsync command nicely maps to a different hardware queue than the cuMemcpyHtoDAsync command. As a result, bidirectional transfers are effectively full-duplex.
In contrast, the transfer queue exposed in Vulkan appears to be only able to access the general purpose copy-queue (same one as cuMemcpyHtoDAsync is using). So uploads are effectively blocking downloads, and vice versa, even if nicely batched by transfer direction.
In a naive test, copying from a buffer allocated in heap 1 to image in heap 0, using copyBufferToImage followed by pipeline barrier on , always results in the work being scheduled on the general purpose queue.
Same result when copying straight from (host visible) buffer to (device local) buffer, using only a buffer memory barrier in the same command buffer, and both buffers even being in dedicated allocations.
That begs the question, is that optimization even implemented at all? Or is overlapping memory transfers just not a considered use case?