I was not getting the expected concurrency when using multiple streams and realized the issue comes from a restriction detailed in chapter 126.96.36.199.4. Implicit Synchronization of the Programming guide:
"Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:
a page-locked host memory allocation, a device memory allocation, a device memory set, a memory copy between two addresses to the same device memory, any CUDA command to the NULL stream, a switch between the L1/shared memory configurations described in Compute Capability 2.x and Compute Capability 3.x."
The issue is that I allocate pinned memory and device memory between two sets of operations in different streams. The solution is quite simple; I just need to allocate memory beforehand. But that implies that you know in advance exactly how much memory you need or that you have some sort of complex memory management mechanism.
My question therefore is, is this restriction likely to be removed in a future release?
If so, would anyone else like to see a functionality where memory allocations (and deallocations) could be streamed just like transfers and kernels? This way, memory could be used only where and when it is needed.