So, each thread has to do the allocation of the pinned memory in isolation, in order to perform asynchronous copies or it doesn’t matter?
For example, in my program the allocation of the buffers is performed during the initialization (and before the threads have been created) and then each thread will use its own buffer to do its work on the corresponding GPU. Will the copies of each thread be asynchronous?
I’m asking this because on a different topic in this forum it is mentioned that each thread has to do the allocation in its own, in order for the copies to be asynchronous. Is it true?
From what I have seen and read here you have to allocated pinned memory in the thread you want to use it. You can use it in any thread however you will not have a performance benefit if you do.
If you want the fast file transfer I think you have to allocate the memory you bind the CUDA context to.
This is actually what I have understood by reading similar posts. However, this behavior seems to me weird.
From what I have learned until now, by programming on a single GPU environment, the page-locked memory address range is tracked by cudaMemcpy-family functions, so they can accelerate copies to/from the device asynchronously.
Why has the page-locked memory be allocated from each thread in order for the memory copies to be asynchronous - assuming that each copy will be performed by a different thread ?