global-global memory transfers

hello,

if i need to transfer data in global memory to global memory, based on (block) results, as part of a gather operation, in turn as part of a result-across-multiple-kernel-block-consolidation, which would be better:

to have the kernel that does the consolidation use dynamic parallelism to issue a number of memory copies (cudaMemcpyDeviceToDevice), or to have the kernel issue actual instructions: global[a] = global[b]?

i am not entirely sure what something like global[a] = global[b] would imply at instruction level - would the data actually have to enter the sm, etc…

i would rather have the device do the consolidation than the host, and a kernel would have to do it, as the transfers would be based on counts and offsets, that must be read in

As far as I am aware cudaMemcpy (,cudaMemcpyDeviceToDevice) is implemented as a kernel or possibly a family of kernels. You can easily see from disassembly of an executable (use cudobjdump --dump-sass) what code gets emitted for global[a] = global[b]: An LD or LDG instruction followed by a dependent ST instruction.

“As far as I am aware cudaMemcpy (,cudaMemcpyDeviceToDevice) is implemented as a kernel or possibly a family of kernels”

seems you are right: according to the pg:

“Note that this may also apply to cudaMemcpyAsync() ,
which might itself generate a kernel launch.”

but it seems to be conditional (may/ might), and i do not really understand why, or when

this is the only clue i can get:

“Note that calls to cudaMemcpyAsync() or cudaMemsetAsync() may invoke new
child kernels on the device in order to preserve stream semantics”

which is not very helpful - it doesn’t say much

I would always settle for instructions. Dynamic parallelism appears to be rather slow, while with your own move instructions you can easily reach device memory bandwidth without much optimization effort.

yes, it seems that dynamic parallelism is simultaneously permissive and restrictive, with far more fine print than i could imagine

i still do not understand why device calls to cudaMemcpy need to invoke child kernels

i was much hoping that, through the mechanisms of dp, i could issue from the device cudaMemcpyDeviceToDevice, that would be similar to cudaMemcpyDeviceToHost or HostToDevice, thereby altogether circumventing the need to involve warps for something as trivial as moving data in global memory, asynchronously i may add

i think i must reconsider using the host for the consolidation

Implementing memcpy() as a library function on the device needs to involve some form of reorganization of the work, as each thread copying it’s own memory would be extremely inefficient. Device side kernel launches provide for this kind of reassignment of work to threads, but are themselves inefficient.

Providing this kind of service in a sufficiently general manner is not easy. Which is why writing your own specialized code usually beats library routines.

I am not sure why this info leads you to conclude you might need to use the host instead, which probably is less efficient.

i need to consolidate results periodically, but not necessarily synchronously - i just need to know that, at a certain point, it is done and completed, in order to carry on

if device side ‘blunt’ memory copies are that inefficient (the general picture i now formed), i might as well copy the data to the host, have it issue other work to the device in the meantime, whilst it reference the data copied, and issue the corresponding memory transfers from the host, accordingly
i would then maintain overall host/ device synchronization with events

this should be more efficient than doing the transfers from the device, not so?

It very much depends. Device memory bandwidth is 10x to 100x larger than PCIe bandwidth. Latency also is significantly better.
If you can make use of resources that would otherwise go idle by scheduling a few DMA short transfers then host side result consolidation might be “almost free”. But if the amount or resources needed is large or contended for, device side might perform a lot better.

Host<->device transfers use DMA engines. To my recollection, while these DMA engines could be used for device<->device transfers they cannot saturate device bandwidth, have fairly high setup overhead that affects small transfers, and have only limited flexibility when it comes to strided transfers and the like. Because PCIe links operate at much lower performance than device memory, this it is not much of an issue there.

So early in the life of CUDA it turned out that using copy kernels to accomplish device-to-device copies is the way to go. This is really no different from CPUs where DMA engines exist, but system memory copies are accomplished through memcpy(), bcopy(), memmove() library or system functions that use code to move the data around. Usually those API calls map to multiple highly optimized kernels, depending on the specifics of the transfer (at least they did on the SPARC platform I worked on years ago).

For any custom copy operation involving scatter / gather in device memory [as I understand you are contemplating] it would be best to write a custom copy kernel for maximum flexibility and performance.

the data is too much for a single array/ single iteration setup; hence, i use multiple iterations, limited number of arrays

consolidation involves moving - gathering - sub-arrays - blocks of data - out of the temporary arrays, to a more permanent array, after a predicate has been applied, to clean the temporary array for reuse

there is some time span before the temporary array must be cleaned for reuse, i doubt whether i am already putting much pressure on the pci link, and i am rather confident that the majority of the work done by the device would be memory bound already, without it having to mind the consolidation transfers (via kernels) as well

hence, even though less ‘powerful’, copy engines may be more fitting than copy kernels in this case it seems, as it seems to imply potentially better resource balancing