Sharing GPU global memory with multiple CPU threads

isaaclee2313 · February 25, 2019, 7:07am

I know that sharing global GPU memory with multiple processes is possible with CUDA IPC API. How about sharing it among multiple threads of the same process?

saulocpp · February 25, 2019, 1:50pm

Use UMA (cudaMallocManaged) and pass the pointer to whatever function that needs to read GPU-processed data in host, just respecting the synchronization rules so the program doesn’t crash by trying to access on CPU a pointer that is still owned by the GPU.

dlevi · February 25, 2019, 3:02pm

If it is all in the same process (and CUDA context), anything allocated with cudaMallocXyz should be freely shareable across threads.

Of course, all the usual caveats about sharing memory across CPU threads and/or GPU kernels apply.

I’ve done this with both cudaMalloc and cudaMallocManaged.

Also, something I’m curious about, since one of the memory-usage patterns in the application I work on seems similar to yours. Memory is allocated with cudaMallocManaged, populated, then used (read-only) simultaneously in multiple kernels launched from multiple threads. If the data isn’t prefetched (cudaMemPrefetchAsync), there’s a huge performance hit. The profiler indicated massive number of page faults, even if the same kernels are re-run using the same memory.
I’d be curious to know if you experience the same behavior. (This was on linux, CUDA 9.2 and 10, with a 1080 Ti)

isaaclee2313 · February 26, 2019, 8:40am

Thanks for your helpful answers!

this post https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/ explains in depth about the mechanisms behind managed memory: why pre-fetching increases performance. I hope this helps :)

dlevi · February 26, 2019, 9:44pm

Yes, I’m familiar with that article. It’s very good as far as it goes, but it deals with only a couple of streams.

The application I work on is a somewhat atypical use-case for GPU acceleration. It can launch kernels on a dozen streams (or a several dozen, in extreme cases), depending on user load. The page-faulting only appeared with active kernels on several streams (around 8+).

Robert_Crovella · February 26, 2019, 9:51pm

That is generally expected behavior.

Try using a memory hint to inform the runtime of where the data should stay. Of course if you access it somewhere else, e.g. from the host during that process, then all bets are off.

Topic		Replies	Views
global cuda memory and os-threads CUDA Programming and Performance	13	12337	January 21, 2009
Unexpected managed (unified) memory behaviour CUDA Programming and Performance	0	556	May 29, 2019
cudaMalloc and threads "invalid device pointer" error CUDA Programming and Performance	4	5452	June 26, 2007
How to free gpu pages in unified memory so that subsequent cudaMalloc can use more memory? CUDA Programming and Performance	0	19	July 7, 2025
Data transfer between multiple GPUs How to do it fast ? CUDA Programming and Performance	4	2557	January 21, 2010
Questions for multiple GPUs CUDA Programming and Performance	8	7178	April 20, 2009
Share GPU/host pinned memory between host processes CUDA Programming and Performance	5	4038	March 7, 2012
cudaMalloc() vs cudaMallocManaged() wrt to cudaMemcpy() CUDA Programming and Performance	9	4589	October 11, 2018
Sharing PagedLockMemory between Processes CUDA Programming and Performance	2	3019	October 3, 2009
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9746	September 22, 2007

Sharing GPU global memory with multiple CPU threads

Related topics