Enabling Concurrent Managed Memory access on GPU

Hello,

I was wondering if it is possible to set the cuda device “ConcurrentManagedAcess” to 1? If ConcurrentManagedAcess is 0 by default, does this mean the GPU is not capable of concurrent access?

Or is this just the default device setting, which can be changed?

If the GPU is not capable of Concurrent managed access, what would be an efficient way to concurrently allocate unified memory concurrently, while other GPU kernels may be running,

Thanks!

It’s not a setting. It cannot be changed by application software alone.

concurrent managed access (if 1) means effectively that UM demand-paging is in effect. This is both a GPU and a platform issue.

it will be 0 on windows, including WSL.
it will be 1 on Pascal or newer GPUs on Linux.
There may be other variations, for example Jetson platforms may be special cases.

it doesn’t have anything to do with allocation while kernels are running.

pinned memory is another way to get concurrent access to an allocation.

If you had a pascal GPU, for example, you could “change” it from 0 to 1 by switching your OS from windows to linux (again, not WSL)

@Robert_Crovella Thank you for your quick response. Is pinned memory also a construct which is accessible from both GPU and CPU without memcpy operations?

What is the main difference between pinned memory and unified memory?

Yes, pinned memory is accessible from both CPU and GPU without memcpy operations.

unified memory and pinned memory behave in subtly different ways. I don’t know that I will be able to summarize everything.

  • pinned memory is host memory that is accessible to device code. It will “never” be moved to device memory but is accessible from both host and device. On the device side, accesses to it will trigger PCIE activity, which means best case bandwidth will be on the order of 10-60GB/s (the PCIE bus bandwidth).
  • unified memory in the best case refers to migratable memory. When a UM allocation is migrated to the device, subsequent device accesses will proceed at “full speed” which could be 300GB/s or even much higher, depending on the GPU

You can get a more orderly treatment of these topics in unit 6 of this online tutorial series. (pinned memory receives some treatment in unit 7).

From what I have seen, for sufficiently large transfers across PCIe, the practically achievable bandwidth is 13 GB/sec (PCIe3), 25 GB/sec (PCIe4), and 50 GB/sec (PCIe5). As PCIe is a full-duplex interconnect, this is per direction, and both directions may be operating concurrently if the GPU has a sufficient number of DMA engines (at least 2).

To create a balanced system, one needs to pay attention to system memory throughput when there are multiple GPUs in the same system that all use a large part of their respective bi-directional PCIe bandwidth.

Try to avoid to allocate any memory or pin memory, when kernels are running and your system is in a time critical state. Do everything beforehand.

@Curefab . I have a use case where images are being streamed at a rapid rate (~60 per second).

Every time a new image arrives, I allocate new CUDA unified memory for GPU processing of the image. Once the image is allocated, a separate CPU thread is signaled to perform GPU processing of that image within it. Therefore, the GPU processing of the current streamed image happens concurrently (in a separate CPU thread) to allocation of newly streamed images, which are again allocated with CUDA unified memory.

I can’t just pre-allocate a single image, as the program needs to handle unified allocation of many newly allocated images, while others are being processed.

@Curefab @Robert_Crovella Do you have a suggestion of an optimal design for such a use case?

If it were my code, I would figure out the maximum number of image allocations needed, allocate those ahead of time, and create a pool of pointers or handles to those images. When a new image is needed, grab one from the pool. When an image is finished, return it to the pool. And if concurrentManagedAccess is zero (such as on windows), I probably wouldn’t use UM for allocations.

The statement about UM doesn’t have much connection in my view to the allocation steps (as I’ve already suggested what I would do about that) but about usage. When concurrentManagedAccess is zero, it can be more difficult to use UM in a multithreaded, asynchronous scenario. If you want to do it anyway, you could implement something like this if you can arrange for thread “independence” of allocations (that is, an allocation in use by a thread is not in use or touched by any other thread).

On the positive side, UM use with concurrentManagedAccess of zero means that the usual caveats around UM slowing down kernel access (due to demand-paging) and the suggestions around prefetching don’t apply. UM is actually “fairly efficient” from that perspective, when concurrentManagedAccess is zero.

Personally I would avoid managed memory for performant programs, where it is known, if large blocks of memory are needed on the GPU or CPU at respective times, and manually transfer the memory with asynchronous copies.

Best use a pinned ring buffer with a capacity of 10 to 20 frames for such cases.
It is allocated in the beginning of the program and stays allocated throughout.
The ring buffer remembers, which images are processed and where the next free frame is.

Also: Find out, how the newly arriving images (e.g. a camera) stores the images early on. Perhaps even that buffer can be pinned and used directly to avoid CPU-CPU memory copies to your second ring buffer (or alternatively the API of your camera could support that you provide (pinned) memory storage, which the driver would use).

@Curefab Thanks for the feedback. I had considered this option also. Would you consider 18MB of GPU memory a “large” block? That’s the amount of memory needed per image.

Transferring 18 MB takes around (depending on PCIe generation) 1ms to transfer between CPU and GPU. With managed memory and several transfers it could take several ms. With 60 fps you get a new frame every 16ms, so a few ms could worsen your envisioned performance.

One can optimize managed with instructions like cudaMemPrefetchAsync to get nearly the performance of normal global device memory. But then the effort to manage it manually is at least as high as if you just manage your device memory by yourself.

Managed memory has use cases for

  • beginner’s code,
  • quick-and-dirty solutions,
  • debugging,
  • for memory structures, where only part of the memory has to be copied back and forth, and the parts are large enough for page-wise copies (otherwise use zero-copy memory),
  • and when it is not known beforehand, if the memory is accessed or not (and you want to save unnecessary copies).
  • It could also be helpful for multi-GPU setups.

Unoptimized managed memory can be really slow.