I should have explained everyhing in detail at first, I’m sorry for my ignorance on this title.
All devices, even CPU, will first copy to their local memory(CPU to another array, GPU to its own mem), then after they compute, they write partial results back to same array without overlapping on address but time.
Multiple cuMemCpyDtoHAsync and opposite direction concurrently. If it is HtoD, CPU is reading too, as another compute unit.
I’m trying to make sure this won’t be a problem when N GPUs memcpy concurrently on different locations or memcpy for reading from host to device concurrently, and also CPU acting as a helper compute unit, reading to another array, concurrently. Not direct-access, only async memcpy.
Documentation says pinning gives performance but is limited. My applicatins will need GBs of data to copy to all devices, even including CPU as a co-processor for cuda(but out of cuda context ofcourse). For example, quadros have 100ish gflops of double-precision while cpu has 300ish gflops.
I think i mis-used access word here. I was thinking implicitly of memcpy. Yes, I’m reading document at the same time.
At the moment, I only have a single GPU access to a grid card from cloud but will have multiple quadros in future.
Thank you for your time.