Sharing GPU global memory with multiple CPU threads

I know that sharing global GPU memory with multiple processes is possible with CUDA IPC API. How about sharing it among multiple threads of the same process?

Use UMA (cudaMallocManaged) and pass the pointer to whatever function that needs to read GPU-processed data in host, just respecting the synchronization rules so the program doesn’t crash by trying to access on CPU a pointer that is still owned by the GPU.

If it is all in the same process (and CUDA context), anything allocated with cudaMallocXyz should be freely shareable across threads.

Of course, all the usual caveats about sharing memory across CPU threads and/or GPU kernels apply.

I’ve done this with both cudaMalloc and cudaMallocManaged.

Also, something I’m curious about, since one of the memory-usage patterns in the application I work on seems similar to yours. Memory is allocated with cudaMallocManaged, populated, then used (read-only) simultaneously in multiple kernels launched from multiple threads. If the data isn’t prefetched (cudaMemPrefetchAsync), there’s a huge performance hit. The profiler indicated massive number of page faults, even if the same kernels are re-run using the same memory.
I’d be curious to know if you experience the same behavior. (This was on linux, CUDA 9.2 and 10, with a 1080 Ti)

Thanks for your helpful answers!

this post https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/ explains in depth about the mechanisms behind managed memory: why pre-fetching increases performance. I hope this helps :)

Yes, I’m familiar with that article. It’s very good as far as it goes, but it deals with only a couple of streams.

The application I work on is a somewhat atypical use-case for GPU acceleration. It can launch kernels on a dozen streams (or a several dozen, in extreme cases), depending on user load. The page-faulting only appeared with active kernels on several streams (around 8+).

That is generally expected behavior.

Try using a memory hint to inform the runtime of where the data should stay. Of course if you access it somewhere else, e.g. from the host during that process, then all bets are off.