Hi, this is a copy of something I posted in the “CUDa Programming and Performance” forum, and I was advised there that I might get better support here from people more used to Jetson specific quirks.
I’m trying to find references and documentation about how asynchronous data transfers are scheduled and executed, in particular when they are issued to separate streams. I’ve found tons of references to the basics, about cudaDeviceProp.asyncEngineCount
and how a value of 1 is required to overlap kernel with a transfer, and a value of 2 for overlaping a kernel with both upload and download. However I’ve not found any reference yet to what order data transfers are executed in when they are going in the same direction but belong to separate streams.
The exact issue I’m up against is as follows. I’m using cuda 10 with Ubuntu 18.04 on a Jetson AGX. From what I can tell this is volta architecture, with some odd quirks, like not supporting “concurrent managed access” when using Unified Memory, or supporting bi-directional data transfers overlapping with kernel execution. I’ve just recently diagnosed an issue in my code where cudaMemcpyAsync
calls were always being satisfied in the order they were originally issued, despite being issued by separate host threads and being directed towards separate streams. The way I had things originally written, cudamemcpyasync calls were made long in advance, and the result was having near perfect failure to overlap data transfer and computation. Most of the time, if a given stream finished it’s kernel execution and was ready to download the results, it would instead sit idle despite there being no active data transfers. After carefully trawling though profiler timelines it became apparent each instance of this was because the memcopy was waiting to start until a copy previously issued to another stream had not completed (or even started) yet.
As an aside, yes I’m aware that memory on the Jetson is shared between the host and gpu, and I could write code that doesn’t require memcpy at all. My team has not yet decided on our target GPU hardware and while the Jetson is a candidate, we will soon evaluate other options. Some of the code I’m writing now is merely prototypes designed to eventually run on other types of systems. I don’t know if this issue is particular to the Jetson or not.
Anyway, I’ve recently made a reddit post detailing the issue, and someone else responded demonstrating that their system does not work the way I observe. So I’m interesting in learning about what the actual constraints are and how to determine the relevant scheduling capabilities of different systems. The reddit post contains a minimal working code example as well as results from the nsight profiler. I failed to see any forum rules that might forbid me from posting external links, so I’ll link it directly here. If that’s an issue though I can remove and paste the code directly here.