cuMemAlloc blocks until cuMemCpyHtoDAsync completes

The above gist contains a simple program that launches a background thread to perform a cuMemCpyHtoDAsync, and then the main thread attempts to allocate a new device buffer with cuMemAlloc. For memory copies around 4MB, it seems that the cuMemAlloc call will block until the cuMemCpyHtoDAsync running in the background has completed.

The gist demonstrates this by blocking the stream with cuStreamWaitValue32 before enqueueing the memory copy. The result is that the cuMemAlloc call never completes - the driver seems to refuse to allocate new memory while there is an outstanding memory copy waiting in the stream.

Is this expected behaviour? I couldn’t see anything in the documentation that suggests a cuMemAlloc call cannot proceed while there are outstanding memory copies.

I haven’t studied your gist closely - I can take a closer look if you wish. But in general, I expect device memory allocation requests to be blocking and synchronizing. These requests adjust the memory map of the GPU, and may have other ramifications in a UVA regime, and so that is how I rationalize or justify the behavior.

There are plenty of what I would call similar inquiries littered about the forums like here and on SO.

My general understanding is that a device memory allocation request forces all previous CUDA activity to complete before it will proceed.

Some indication of this is in the programming guide:

and (for the runtime API, but driver API should be similar AFAIK):

Some possibly relevant info may also be in the discussion of UVA regime:

If this is not what you had in mind, please clarify. I may have misunderstood your question.

Thanks for the reply - that certainly makes some sense, and indeed would rationalise the behaviour that I’m seeing.