cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking?

Usually I can find some answers pretty quick just by searching for them, but this time I was unable to find something.

My understanding with how cudaMemcpyAsync works is that it transfers allows for asynchronous transfers, at the cost of certain safeties that the synchronous behaviour can uphold, part of those safeties being that the synchronous behaviour guarantees that the correct data is delivered.

cudaMemcpyAsync usually enforces safety by requiring pinned memory or it will perform as a synchronous call instead, however when using a stream set with the cudaStreamNonBlocking flag, it does not make this enforcement and allow asynchronous calls with non-pinned host memory.

I’m concerned, as would this technically not allow us to perform unsafe asynchronous calls? Am I missing something else? What exactly is different between a normal stream and a stream set with the cudaStreamNonBlocking flag as true?

I wouldn’t describe it that way. Asynchronous data transfer still transfers the “correct” data. If you believe that the data is “incorrect”, it probably means you don’t understand stream semantics or don’t understand your application behavior.

Again I would not describe it that way, but your basic observation is correct. If you do a cudaMemcpyAsync on a data buffer that is not pinned, but set the cudaStreamNonBlocking flag when you create the stream used for the transfer, then you may observe overlap/concurrency/asynchronous behavior between e.g. that data transfer and a kernel call issued to another stream. This appears to be in conflict with certain statements made in the CUDA C programming guide.

In any event, cudaMemcpyAsync issued to a non-default stream should be assumed to have asynchronous characteristics. None of this means “incorrect” data will be transferred, but it does mean that you will need to understand stream semantics as well as the behavior of your application, in order to understand what data will be transferred. None of this should be conflated with “safety”, whatever that means in this context.

I also personally would not rely on this asynchronous-behavior-even-without-pinned-memory for my application design, as it is not clearly defined in the CUDA C programming guide, and I’m unaware of any design pattern that should require it or be unachievable in some other way.

I’m not claiming to be an expert on stream semantics or anything of the sort, but from what I could understand, unpinned memory has the chance of being page-swapped out from physical memory and transfer incorrect date, thus in some ways “unsafe”.

I did some tests both with and without cudaStreamNonBlocking and found that the was a chance that the cudaStreamNonBlocking streams would not pass the tests, the idea was originally that with unpinned memory the data transfers would be safe, so I could safely assign cudaStreamNonBlocking to all my streams and unpinned memory transfers would automatically default to synchronous behaviour, but alas.

This error could be either due to incorrect data transfers, dependencies not being registered or something else that I’m so far unaware of, which I’ll have to investigate further.

Lastly, what do you mean by stream semantics?

No, that isn’t how it works. It’s true that unpinned memory could be swapped out. However, just like any other request for unpinned/swapped out memory, a request to use that memory will force the (operating system) memory manager to bring that memory back into physical RAM again, before the request is allowed to complete. You don’t run the risk of getting invalid data because a page is swapped out.

Yes, I would agree that if you are depending on the conversion-to-synchronous behavior of a call to a function that is explicitly marked as Async, then your expectations may not be met if you choose cudaStreamNonBlocking, just as you have pointed out and discovered. However, I’ve already suggested I would not depend on that design pattern (instead: if you make an Async call, expect the possibility of async behavior), and furthermore you seem to be intentionally digging a hole and then wondering why you are in the bottom of a hole. You could rectify this in a number of ways that I would consider to be “typical” CUDA programming, for example, pin the memory and expect asynchronous behavior. Or don’t use cudaStreamNonBlocking.

I’m using the word “semantics” here as a substitute for the word “meaning”. Asking "What are the semantics of streams? is like asking “What is the meaning of streams?” i.e. what is the description, definition and behavior of streams.

Stream semantics are quite simple in my view, and describe nearly all stream behavior I am aware of:

  1. activities issued to the same stream will serialize. That is, if item B is issued to a stream after item A is issued to the same stream, then item B will not begin execution until item A has finished execution.
  2. activities issued to separate streams have no defined ordering relationship

The main area of stream behavior that they don’t define is the CUDA default stream (NULL stream), which has modifiable behavior. For this, a very simple and safe design pattern is to never use the default stream for any timing or performance sensitive code. With that assumption, items 1 and 2 fully describe stream semantics, in my view/opinion/definition. Furthermore, with that design pattern, use of cudaStreamNonBlocking should not be needed, and therefore confusion around the behavior of cudaStreamNonBlocking is avoided, in my view.

In my view, a programming mindset that says I will issue operations into separate streams, and then expect that because I did not pin the memory I will get a conversion to synchronous behavior, and I’m actually depending on that synchronous behavior for correctness (of ordering of activity), is really not consistent with the CUDA stream programming model.

When you issue operations into separate streams, you should assume there is no defined ordering relationship between those two operations. That’s just my opinion, as are most of my comments I post here.

I’m sure there is an alternate viewpoint that says “I should be able to depend on published statements of behavior.” For that, I have no response. You’re welcome to file bugs at developer.nvidia.com to correct any behavioral issues you perceive.

Well that’s one misassumption that I made about the functions behaviour. My idea was that the data would possibly be misread and therefore, in order to guarantee this not happening, the synchronous method would pin and release the memory before and after the transfer. This makes me wonder why the asynchronous behaviour does not work without pinned memory, as at this point it seems like an optimization issue, more so than anything else, which could be left to the programmer? Only a thought of course.

For the time being, I’ve stopped using it. I’m simply trying to learn what’s going on here, since clearly there must be something I don’t understand. My idea assumed that it would default to the synchronous behaviour, not because of a strict need for synchronous behaviour, but because the synchronous behaviour would automatically pin memory before transfers and therefore be safe. Being that asynchronous behaviour is also safe, there is something else that I misunderstood, which seems to be the ordering of calls.

Another misassumption that I made it seems. I assumed that Cuda might be able to detect data dependencies across streams.

I looked up why I assumed that cudaMemcpyAsync would perform as a synchronous method in certain conditions and it was actually another comment from you, years ago.
https://devtalk.nvidia.com/default/topic/899020/cuda-programming-and-performance/does-cudamemcpyasync-require-pinned-memory-/post/4737986/#4737986

The problem that I am currently working on is a system where there are already many different GPU kernel calls located in different parts of code. I’m simply trying to find a way to optimize the behaviour so that kernels that can work together on the GPU can actually do so, which requires making streams, events and calling asynchronous functions.

I’m not here to complain about anything, I’m just here to try to understand what I don’t by asking people who I expect do. So far, this has already been enlightening.

my best guess without a case to inspect is that you have tripped over an ordering issue.

I fully agree that it’s reasonable to expect that cudaMemcpyAsync from a non-pinned host allocation would result in synchronizing behavior. Ignoring what I said or didn’t say, it seems quite explicit based on the wording in the programming guide:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-data-transfers
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#overlap-of-data-transfer-and-kernel-execution

“Some devices can perform an asynchronous memory copy to or from the GPU concurrently with kernel execution. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is greater than zero for devices that support it. If host memory is involved in the copy, it must be page-locked.

(an approximately identical statement is repeated in sections 3.2.5.3 and 3.2.5.4 of the current CUDA 9.2 programming guide)

The previous thread you linked also excerpted a CUDA blog which said something similar. So I don’t think I’m manufacturing anything out of whole cloth here. The programming guide statement seems fairly clear. On the other hand, its easy to disprove this with a fairly simple test case using a stream created with the non-blocking flag. In that case, concurrent data transfer and kernel execution can be witnessed without the use of pinned/page-locked memory.

This seems to be a poorly documented case. Therefore I think the safe thing to say is that based on observation, the non-blocking flag appears to abrogate the above bolded statement in the programming guide, at least in some test conditions.

Since I am beating a dead horse, I’ll repeat myself. I think there is a fairly straightforward usage of CUDA streams, while avoiding the usage of the default stream altogether, which allows the programmer to handle any desired and feasible scenario with CUDA streams, without running into these issues. When you need an ordered activity relationship, use item 1 of the previously recited stream semantics to enforce whatever ordering is needed. Do not (i.e. I personally would not) depend on any other mechanism to enforce needed ordering.

And, similarly, unless you have a solid understanding of what the non-blocking flag does, I don’t recommend using it.

Thanks for the help. For the time being, I have no choice but to use the default stream, although I’m working hard on trying to get rid of it, but it’s not easy, as not all kernel-creating locations have access to each other right now, and I don’t like the idea of each method creating and managing its own stream, as I would like to have some form of centralized control over the streams and events so that I can manage dependencies on data.

Because of this current dependency on the default stream, I will not see overlap between computation and data transfer unless I use the non-blocking flag.

I’ll ask for more help when I have more substantial questions to be answered, as most of my questions right now can probably be found in the manuals or from other posts.