The above gist contains a simple program that launches a background thread to perform a
cuMemCpyHtoDAsync, and then the main thread attempts to allocate a new device buffer with
cuMemAlloc. For memory copies around 4MB, it seems that the
cuMemAlloc call will block until the
cuMemCpyHtoDAsync running in the background has completed.
The gist demonstrates this by blocking the stream with
cuStreamWaitValue32 before enqueueing the memory copy. The result is that the
cuMemAlloc call never completes - the driver seems to refuse to allocate new memory while there is an outstanding memory copy waiting in the stream.
Is this expected behaviour? I couldn’t see anything in the documentation that suggests a
cuMemAlloc call cannot proceed while there are outstanding memory copies.