Ordering of cudaMemcpyAsync issued to separate streams

bbyington · February 5, 2019, 6:09pm

Hi,

I’m trying to find references and documentation about how asynchronous data transfers are scheduled and executed, in particular when they are issued to separate streams. I’ve found tons of references to the basics, about cudaDeviceProp.asyncEngineCount and how a value of 1 is required to overlap kernel with a transfer, and a value of 2 for overlaping a kernel with both upload and download. However I’ve not found any reference yet to what order data transfers are executed in when they are going in the same direction but belong to separate streams.

The exact issue I’m up against is as follows. I’m using cuda 10 with Ubuntu 18.04 on a Jetson AGX. From what I can tell this is volta architecture, with some odd quirks, like not supporting “concurrent managed access” when using Unified Memory, or supporting bi-directional data transfers overlapping with kernel execution. I’ve just recently diagnosed an issue in my code where cudaMemcpyAsync calls were always being satisfied in the order they were originally issued, despite being issued by separate host threads and being directed towards separate streams. The way I had things originally written, cudamemcpyasync calls were made long in advance, and the result was having near perfect failure to overlap data transfer and computation. Most of the time, if a given stream finished it’s kernel execution and was ready to download the results, it would instead sit idle despite there being no active data transfers. After carefully trawling though profiler timelines it became apparent each instance of this was because the memcopy was waiting to start until a copy previously issued to another stream had not completed (or even started) yet.

As an aside, yes I’m aware that memory on the Jetson is shared between the host and gpu, and I could write code that doesn’t require memcpy at all. My team has not yet decided on our target GPU hardware and while the Jetson is a candidate, we will soon evaluate other options. Some of the code I’m writing now is merely prototypes designed to eventually run on other types of systems. I don’t know if this issue is particular to the Jetson or not.

Anyway, I’ve recently made a reddit post detailing the issue, and someone else responded demonstrating that their system does not work the way I observe. So I’m interesting in learning about what the actual constraints are and how to determine the relevant scheduling capabilities of different systems. The reddit post contains a minimal working code example as well as results from the nsight profiler. I failed to see any forum rules that might forbid me from posting external links, so I’ll link it directly here. If that’s an issue though I can remove and paste the code directly here.

njuffa · February 5, 2019, 8:24pm

In my experience few people familiar with NVIDIA’s embedded platforms frequent the general CUDA forums. Asking in the forum dedicated to your platform might yield better / faster answers:

https://devtalk.nvidia.com/default/board/326/jetson-agx-xavier/

I have used Teslas and Quadros in workstation and server platforms, and what you describe does not match my experience with asynchronous copies on Linux platforms. It is possible that I misinterpreted your description.

What hardware/software platform did this person use to demonstrate the behavior?

bbyington · February 5, 2019, 8:33pm

Thanks for the suggestion, I’ll cross post in that forum as well. If this is just an issue specific to the Jetson I’ll be somewhat relieved, though it would still be nice if I could find official documentation covering these issues.

As far as the other persons results, they were using CUDA 10.0 on Unbuntu 18 with a GeForce 940MX card.

Two imagr links showing our respective profiles:

Mine: Imgur: The magic of the Internet
Theirs: https://i.imgur.com/Y3a3VSZ.png

Notice that in mine the stream 15 download happens at the very end, while in theirs the stream 15 download is capable of overlapping with the stream 14 kernel execution.

njuffa · February 5, 2019, 8:43pm

While it is not entirely clear what the overall situation is here I would expect to see something more along the lines of “theirs”, not “yours”. Your observation may be based on something specific to your Jetson platform. Maybe just a configuration setting. Maybe a hardware difference between embedded and discrete GPU solutions.

Generally speaking: Everything within a stream happens in-order. The relative order of events between different non-default streams is undefined.

Note: Cases of false dependencies can occur, for example because of OS driver model restrictions (Windows WDDM) or on old hardware with a single command queue (prior to compute capability 3.5, I think).

bbyington · February 5, 2019, 8:49pm

Thank you again for your responses. Yes, something like a false dependency arising from the particulars of my system is what I’m hoping to discover the details of. The Jetson AGX I’m using is Volta architecture, but I’ve already found various quirks in that system that make it act like older architectures in some regards. I’ve taken your advice and made a similar posting in the Jetson forums.

Topic		Replies	Views
Ordering of cudaMemcpyAsync issued to separate streams on Jetson AGX Jetson AGX Xavier	4	785	October 18, 2021
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	799	September 11, 2016
CHECK(cudaMemcpy) performance issues CUDA Programming and Performance	3	1063	January 24, 2022
Bug in cudaMemsetAsync or in Nsight VS Edition when visualizing cudaMemsetAsync execution CUDA Programming and Performance	13	1394	November 11, 2021
Two streams are not working asynchronously CUDA Programming and Performance tensorrt , cuda , jetson-inference	7	747	November 20, 2021
cudaMemcpyAsync CUDA Programming and Performance	10	20834	October 16, 2015
cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking? CUDA Programming and Performance	6	2079	May 29, 2018
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1058	December 15, 2022
can't achieve cudaMemcpyAsync and kernel concurrency Jetson TX1	3	693	October 18, 2021
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2233	January 18, 2023

Ordering of cudaMemcpyAsync issued to separate streams

Related topics