CUDA Graph, Memcpy must be 1D if updated?

Hey!

Stupid question, but for the CUDA Graphs API we have Memcpy3DAsync and MemcpyAsync nodes as options.

But then on the Update-node function for the 3D memcpy: cudaGraphExecMemcpyNodeSetParams it is documented:
“Both the instantiation-time memory operands and the memory operands in pNodeParams must be 1-dimensional”,
where pNodeParams is the cudaMemcpy3DParms input.

So, does this mean we have the option of using the Memcpy3DAsync in our graph, but we are not allowed to update it afterwards to be anything but equivalent to a 1D MemcpyAsync??

I mean, come on?!?

CUDA error name: cudaErrorInvalidValue
Call: cudaGraphExecMemcpyNodeSetParams
At: upload_rf: while updating memcpy operation

1 Like

That text is followed immediately by: “Zero-length operations are not supported.” - so perhaps it’s supposed to say that the operands must be at least 1-dimensional? Just speculating, I’ve not yet done multi-dim copying in a CUDA execution graph.

I tried experimenting a bit, and the call
cudaGraphExecMemcpyNodeSetParams

Fails if the extent of the copy has height > 1 or depth > 1.

It really does appear only 1-D memory copies have graph update support.

Assuming both input and output are device accessible, you could write your own 3d copy kernel whose parameters you can then update in the graph.

I considered this solution as well, but my copies are from host to device.

The work-around I use is to manually split the copies into several 1D copy operations. But I experience a reduced H2D throughput (around 5 GB/s vs 6 GB/s), and also since updating nodes take some time, this method is slower than a normal Memcpy3D if height*depth>4 approximately.

Best scenario would be a Memcpy3D external to the graph if height*depth>4, and synchronization with cudaEventWaitExternal - but I could not get it to work.