Ordering of cudaMemcpyAsync issued to separate streams


I’m trying to find references and documentation about how asynchronous data transfers are scheduled and executed, in particular when they are issued to separate streams. I’ve found tons of references to the basics, about cudaDeviceProp.asyncEngineCount and how a value of 1 is required to overlap kernel with a transfer, and a value of 2 for overlaping a kernel with both upload and download. However I’ve not found any reference yet to what order data transfers are executed in when they are going in the same direction but belong to separate streams.

The exact issue I’m up against is as follows. I’m using cuda 10 with Ubuntu 18.04 on a Jetson AGX. From what I can tell this is volta architecture, with some odd quirks, like not supporting “concurrent managed access” when using Unified Memory, or supporting bi-directional data transfers overlapping with kernel execution. I’ve just recently diagnosed an issue in my code where cudaMemcpyAsync calls were always being satisfied in the order they were originally issued, despite being issued by separate host threads and being directed towards separate streams. The way I had things originally written, cudamemcpyasync calls were made long in advance, and the result was having near perfect failure to overlap data transfer and computation. Most of the time, if a given stream finished it’s kernel execution and was ready to download the results, it would instead sit idle despite there being no active data transfers. After carefully trawling though profiler timelines it became apparent each instance of this was because the memcopy was waiting to start until a copy previously issued to another stream had not completed (or even started) yet.

As an aside, yes I’m aware that memory on the Jetson is shared between the host and gpu, and I could write code that doesn’t require memcpy at all. My team has not yet decided on our target GPU hardware and while the Jetson is a candidate, we will soon evaluate other options. Some of the code I’m writing now is merely prototypes designed to eventually run on other types of systems. I don’t know if this issue is particular to the Jetson or not.

Anyway, I’ve recently made a reddit post detailing the issue, and someone else responded demonstrating that their system does not work the way I observe. So I’m interesting in learning about what the actual constraints are and how to determine the relevant scheduling capabilities of different systems. The reddit post contains a minimal working code example as well as results from the nsight profiler. I failed to see any forum rules that might forbid me from posting external links, so I’ll link it directly here. If that’s an issue though I can remove and paste the code directly here.


In my experience few people familiar with NVIDIA’s embedded platforms frequent the general CUDA forums. Asking in the forum dedicated to your platform might yield better / faster answers:


I have used Teslas and Quadros in workstation and server platforms, and what you describe does not match my experience with asynchronous copies on Linux platforms. It is possible that I misinterpreted your description.

What hardware/software platform did this person use to demonstrate the behavior?

Thanks for the suggestion, I’ll cross post in that forum as well. If this is just an issue specific to the Jetson I’ll be somewhat relieved, though it would still be nice if I could find official documentation covering these issues.

As far as the other persons results, they were using CUDA 10.0 on Unbuntu 18 with a GeForce 940MX card.

Two imagr links showing our respective profiles:

Mine: https://imgur.com/a/uKuwyU7
Theirs: https://i.imgur.com/Y3a3VSZ.png

Notice that in mine the stream 15 download happens at the very end, while in theirs the stream 15 download is capable of overlapping with the stream 14 kernel execution.

While it is not entirely clear what the overall situation is here I would expect to see something more along the lines of “theirs”, not “yours”. Your observation may be based on something specific to your Jetson platform. Maybe just a configuration setting. Maybe a hardware difference between embedded and discrete GPU solutions.

Generally speaking: Everything within a stream happens in-order. The relative order of events between different non-default streams is undefined.

Note: Cases of false dependencies can occur, for example because of OS driver model restrictions (Windows WDDM) or on old hardware with a single command queue (prior to compute capability 3.5, I think).

Thank you again for your responses. Yes, something like a false dependency arising from the particulars of my system is what I’m hoping to discover the details of. The Jetson AGX I’m using is Volta architecture, but I’ve already found various quirks in that system that make it act like older architectures in some regards. I’ve taken your advice and made a similar posting in the Jetson forums.