Is memcpy from pinned host memory to GPU memory inside a kernel performed by a DMA engine?

Hi,

Is memcpy from pinned host memory to GPU memory inside a kernel performed by a DMA engine? So that the accordingly SM is free to switch to another warp and perform some arithmetic operations.

No, it is not. (Just study PTX or SASS to convince yourself.) Those will create LD and ST instructions, and are performed by the SM/LSU/Memory controller, which basically issues PCIE cycles directly.

Nearly all activities (ie. machine instructions) within the SM are “fire and forget”. They are pipelined and delivered to a unit that handles them, producing the result some time later, and “free-ing” the SM to issue some other instruction in the very next cycle.

If you have an instruction that involves a transaction to global memory (and pinned host memory is in the global space), there is no reason to think that the SM is not free to switch to another warp and perform some arithmetic operations, in any subsequent instruction cycles.

Also, I wouldn’t use memcpy() for bulk movement of data from global to global or to shared, except in certain narrow cases. A standard block-stride copy construct is the way to go for efficiency. Regardless whether you write your own code, or use in-kernel memcpy, it will not be handled by a DMA engine.

Many thanks @Robert_Crovella !

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.