Ordering of cudaMemcpyAsync issued to separate streams on Jetson AGX

Hi, this is a copy of something I posted in the “CUDa Programming and Performance” forum, and I was advised there that I might get better support here from people more used to Jetson specific quirks.

I’m trying to find references and documentation about how asynchronous data transfers are scheduled and executed, in particular when they are issued to separate streams. I’ve found tons of references to the basics, about cudaDeviceProp.asyncEngineCount and how a value of 1 is required to overlap kernel with a transfer, and a value of 2 for overlaping a kernel with both upload and download. However I’ve not found any reference yet to what order data transfers are executed in when they are going in the same direction but belong to separate streams.

The exact issue I’m up against is as follows. I’m using cuda 10 with Ubuntu 18.04 on a Jetson AGX. From what I can tell this is volta architecture, with some odd quirks, like not supporting “concurrent managed access” when using Unified Memory, or supporting bi-directional data transfers overlapping with kernel execution. I’ve just recently diagnosed an issue in my code where cudaMemcpyAsync calls were always being satisfied in the order they were originally issued, despite being issued by separate host threads and being directed towards separate streams. The way I had things originally written, cudamemcpyasync calls were made long in advance, and the result was having near perfect failure to overlap data transfer and computation. Most of the time, if a given stream finished it’s kernel execution and was ready to download the results, it would instead sit idle despite there being no active data transfers. After carefully trawling though profiler timelines it became apparent each instance of this was because the memcopy was waiting to start until a copy previously issued to another stream had not completed (or even started) yet.

As an aside, yes I’m aware that memory on the Jetson is shared between the host and gpu, and I could write code that doesn’t require memcpy at all. My team has not yet decided on our target GPU hardware and while the Jetson is a candidate, we will soon evaluate other options. Some of the code I’m writing now is merely prototypes designed to eventually run on other types of systems. I don’t know if this issue is particular to the Jetson or not.

Anyway, I’ve recently made a reddit post detailing the issue, and someone else responded demonstrating that their system does not work the way I observe. So I’m interesting in learning about what the actual constraints are and how to determine the relevant scheduling capabilities of different systems. The reddit post contains a minimal working code example as well as results from the nsight profiler. I failed to see any forum rules that might forbid me from posting external links, so I’ll link it directly here. If that’s an issue though I can remove and paste the code directly here.

In my experience with all the Jetson models (TK1, TX1, TX2 and Xavier) because there is a single copy engine (asyncEngineCount = 1) the copies can only execute in the order that they were issued, regardless of which stream or process they are associated with. So, as much as possible, you will want to issue the copies, in either direction, in the order that you anticipate they will be needed, across all streams.

Thanks for the response TroyK. I was aware that I only have a single copy engine, and did originally suspect that was the underlying issue. However, two things:

  1. The only documentation I managed to find for that specific parameter was in regards to overlapping compute with transfer, or using bi-directional transfer. I’ve not seen anything that might imply two asynchronous transfers on two separate streams would have a specific ordering, just that they cannot be simultaneous. If you’re aware of more detailed documentation I’d love to be pointed at it.

  2. In the reddit thread I mentioned, another user ran the same code and did manage to have the two downloads execute in the order they were ready, not necessarily in the order the were issued. They were running on their local laptop, and also had a single copy engine. Its this data point that makes me most think it is a Jetson specific quirk (which I’d love confirmation), and not necessarily directly tied to the number of copy engines.

Just so you don’t have to trawl through links to find what I’m talking about, here are the profiler results for the two different cards, along with the code being run (I should have included this originally instead of just linking, sorry):

Jetson AGX : Imgur: The magic of the Internet
GeForce 940MX : https://i.imgur.com/Y3a3VSZ.png
(The relative width of compute/transfer segments is very different in these two images because of the obviously wildly different hardware, but it’s the order the operations occurred in I’m concerned with)

Note that in the following code stream 14 DtoH transfer is requested before the stream 15 DtoH, despite the stream 15 kernel being designed to complete first. On the Jetson stream 15 waits to grab its data clear until stream 14 finishes, despite sitting idle the whole time. On the GeForce stream 15 instead manages to re-order the transfers, despite only have a single copy engine.

#include <algorithm>
#include <cuda_runtime.h>

__global__ void DummyKernel(int* in, size_t reps, int* out)
{
    size_t offset = 64*64*blockIdx.x;
    size_t val = 0;
    for (size_t r = 0; r < reps; ++r)
    {
        for (size_t i = 0; i < 64; ++i)
        {
            val = val + in[offset + i * 64 + threadIdx.x];
        }
    }
    if (threadIdx.x == 0) out[blockIdx.x] = val;
}

int main()
{
    int* h_in1, *h_in2, *h_out1, *h_out2;
    int* d_in1, *d_in2, *d_out1, *d_out2;

    cudaMallocHost(&h_in1, 5000*64*64*sizeof(int));
    cudaMallocHost(&h_in2, 5000*64*64*sizeof(int));
    cudaMallocHost(&h_out1, 5000*64*64*sizeof(int));
    cudaMallocHost(&h_out2, 5000*64*64*sizeof(int));
    cudaMalloc(&d_in1, 5000*64*64*sizeof(int));
    cudaMalloc(&d_in2, 5000*64*64*sizeof(int));
    cudaMalloc(&d_out1, 5000*64*64*sizeof(int));
    cudaMalloc(&d_out2, 5000*64*64*sizeof(int));

    std::fill(h_in1, h_in1+5000*64*64, 1);
    std::fill(h_in2, h_in2+5000*64*64, 1);

    cudaStream_t s14, s15;
    cudaStreamCreate(&s14);
    cudaStreamCreate(&s15);

    // Schedule both transfers to device.  There can only be one data stream so these
    // two commands will execute sequentially, but due to the use of streams the s1 
    // transfer will end up overlapping with the s2 compute
    cudaMemcpyAsync(d_in2, h_in2, 5000*64*64*sizeof(int), cudaMemcpyHostToDevice, s15);
    cudaMemcpyAsync(d_in1, h_in1, 5000*64*64*sizeof(int), cudaMemcpyHostToDevice, s14);

    // Launch both compute.  These technically could overlap with each other, though
    // I've chosen enough blocks that either of these can saturate the card.  They will
    // only have a brief overlap period while the first kernel is spinning down and can
    // no longer saturate the cards compute capabilities
    DummyKernel<<<5000, 64, 0, s15>>>(d_in2, 10, d_out2);
    DummyKernel<<<5000, 64, 0, s14>>>(d_in1, 100, d_out1);

    // Download results to the host.  If you read the above code astutely you may have
    // noticed that we're requesting the s1 transfer first, but the s2 compute is a lot
    // shorter and will finish first.  You might think the cuda runtime could notice
    // and take advantage and do the s2 transfer first, since these are separate streams
    // and there is no dependance, but it wont!  Transfers seem to be fulfilled in request
    // order even if that results in sub-optimal scheduling!
    cudaMemcpyAsync(h_out1, d_out1, 5000*64*64*sizeof(int), cudaMemcpyDeviceToHost, s14);
    cudaMemcpyAsync(h_out2, d_out2, 5000*64*64*sizeof(int), cudaMemcpyDeviceToHost, s15);

    cudaDeviceSynchronize();

    cudaStreamDestroy(s14);
    cudaStreamDestroy(s15);

    cudaFreeHost(h_in1);
    cudaFreeHost(h_in2);
    cudaFreeHost(h_out1);
    cudaFreeHost(h_out2);
    cudaFree(d_in1);
    cudaFree(d_in2);
    cudaFree(d_out1);
    cudaFree(d_out2);
}

Your experience on the Jetson matches mine. As far as I know, the sequential copies has been an undocumented “feature” of all the Jetson products. And, as you mentioned, the lack of expected unified memory concurrent access on the Xavier seems to be undocumented ([url]https://devtalk.nvidia.com/default/topic/1044067/jetson-agx-xavier/unified-memory-concurrent-access/post/5296756/#5296756[/url]). You’re right, the Jetson has a few “quirks” that the user will encounter. But overall, I must say, the Xavier is an impressive piece of hardware.