I observe 2 concurrent HtoD copies at the same time in my project as shown in https://i.ibb.co/g4dGZVV/image.png. (Screenshot from Nvidia Visual Profiler.)
As far as I know, Titan X (Pascal) has two copy engines for HtoD and DtoH memopy copy, one for each direction, and two concurrent memory copies on PCIe in one direction is not possible due to PCIe limitations. So why the profiling result above is possible?
I learned that copy engine is (probably) not envolved when the data transfer is less than 64KB.(https://devtalk.nvidia.com/default/topic/1027316/cuda-programming-and-performance/titan-v-announced-15-0-tflops-fp32-5120-cores-12-gb-hbm2-vram-3000-us-price/post/5226469/#5226469). Does anyone know what the underlying mechanism is?