Sharing CUDA Host Memory Between Processes

c-bro · May 26, 2011, 8:41pm

I am currently using the cuMemHostAlloc for the performance advantages of pinned, write-combined memory.

I would like to add the ability for a separate processes to place data directly into that memory. As far as I know, there is no way to share that cuMemHostAlloc’d memory with another process.

The other option, instead of sharing CUDA malloc’d memory, is to pin/write-combine a buffer allocated with shmget(), but I doubt CUDA would treat the memory the same way.

Obviously I could just allocate interprocess memory with shmget() and memcpy to the pinned/WC memory before upload, but the extra memcpy will pretty well negate the benefits of using the CUDA malloc in the first place.

Any ideas?

tmurray · May 26, 2011, 9:02pm

I think you can actually call cudaHostRegister on an appropriately allocated piece of memory and it will work. No promises, though, I’ve never tried it…

c-bro · May 27, 2011, 9:48pm

Thanks tmurray, that looks promising. I’m using the driver API and I totally ignored the runtime functions. I’ll give it a try and post my results.

tmurray · May 27, 2011, 9:50pm

oh, it’s cuMemHostRegister in the driver API.

David_S · June 6, 2011, 4:03pm

I’m also trying to pin memory that is shared between processes. I’m using boost::interprocess to create the shared memory, instead of shmget, but I’m sure it’s the same thing. However, when I pin the memory using cudaHostRegister(), the memory no longer functions as shared memory between the processes!

Looking at the memory usage, it appears that it increases with the number of process that pin the memory, even though it is supposedly shared. In other words, suppose I create 1 GB of shared memory. If I pin it using cudaHostRegister() in three processes, the total memory usage increases by 3 GB, which is strange considering that the same RAM should be shared by the three processes. If I pin the memory in just one of the processes, the memory usage is 1 GB, but only the one process has the speed benefit of pinned memory.

So, my question is the same as the original one: does anyone know how to share pinned memory between heavyweight (forked) processes?

c-bro · June 8, 2011, 4:26pm

I’m also trying to pin memory that is shared between processes. I’m using boost::interprocess to create the shared memory, instead of shmget, but I’m sure it’s the same thing. However, when I pin the memory using cudaHostRegister(), the memory no longer functions as shared memory between the processes!

Looking at the memory usage, it appears that it increases with the number of process that pin the memory, even though it is supposedly shared. In other words, suppose I create 1 GB of shared memory. If I pin it using cudaHostRegister() in three processes, the total memory usage increases by 3 GB, which is strange considering that the same RAM should be shared by the three processes. If I pin the memory in just one of the processes, the memory usage is 1 GB, but only the one process has the speed benefit of pinned memory.

So, my question is the same as the original one: does anyone know how to share pinned memory between heavyweight (forked) processes?

I just started doing actual bandwidth tests and the shmget() memory that is page-locked by cuMemHostRegister() is identical (+/- 10us for 3MB transfers) to the transfers on cuMemHostAlloc() memory, which satisfies my original concern.

Your comment about the sharing capability disappearing is a little troubling. I haven’t tried actually using the shared memory between processes yet, but I’ll post with results. Luckily, I’m only using the GPU and pinning memory in a single process that acts like an aggregation point for all my other processes.

Edit: I can confirm that the memory allocated with shmget() and registered with cuMemHostRegister() is still acting as “regular” shared memory between processes, and is also seeing the speedup from being page-locked.

Process A has CUDA context and shmget() memory. Process B does nothing but open the shared memory, write an image, post semaphore, and repeat. Process A, waiting for the semaphore, uploads the image to the GPU at full speed.

David_S · June 17, 2011, 3:53pm

Okay, I figured it out. The problem was boost::interprocess, which doesn’t use the same mechanism for shared memory as shmget (boost uses a ramdisk in /dev/shm). Once I used shmget, I could pin the shared memory and see the increase in data transfer speed to the GPU. Also, the memory usage made sense with shmget: I could pin the same shared memory in multiple processes, and the memory usage didn’t increase with the number of processes where I pinned the memory.

This even worked with the large amount of memory I needed to pin (3 GB). I did need to increase the maximum size of the shared memory for my needs, using:

sysctl -w kernel.shmmax=3000000000

My system functions like this: process A creates the shared memory. Processes B and C (which control one GPU each) attach the shared memory, and pin it. Process D attaches the shared memory as well. Process D writes images to the shared memory, and signals process A using a message queue. Process A determines which GPU to send the job to and signals either process B or C. Finally, process B or C does the fast transfer to the GPU.

Thanks c-bro for reporting on your results, which were very helpful.

David_S · June 17, 2011, 3:53pm

Okay, I figured it out. The problem was boost::interprocess, which doesn’t use the same mechanism for shared memory as shmget (boost uses a ramdisk in /dev/shm). Once I used shmget, I could pin the shared memory and see the increase in data transfer speed to the GPU. Also, the memory usage made sense with shmget: I could pin the same shared memory in multiple processes, and the memory usage didn’t increase with the number of processes where I pinned the memory.

This even worked with the large amount of memory I needed to pin (3 GB). I did need to increase the maximum size of the shared memory for my needs, using:

sysctl -w kernel.shmmax=3000000000

My system functions like this: process A creates the shared memory. Processes B and C (which control one GPU each) attach the shared memory, and pin it. Process D attaches the shared memory as well. Process D writes images to the shared memory, and signals process A using a message queue. Process A determines which GPU to send the job to and signals either process B or C. Finally, process B or C does the fast transfer to the GPU.

Thanks c-bro for reporting on your results, which were very helpful.

Gonen_Raveh · August 29, 2011, 7:05am

Okay, I figured it out. The problem was boost::interprocess, which doesn’t use the same mechanism for shared memory as shmget (boost uses a ramdisk in /dev/shm). Once I used shmget, I could pin the shared memory and see the increase in data transfer speed to the GPU. Also, the memory usage made sense with shmget: I could pin the same shared memory in multiple processes, and the memory usage didn’t increase with the number of processes where I pinned the memory.

This even worked with the large amount of memory I needed to pin (3 GB). I did need to increase the maximum size of the shared memory for my needs, using:
sysctl -w kernel.shmmax=3000000000
My system functions like this: process A creates the shared memory. Processes B and C (which control one GPU each) attach the shared memory, and pin it. Process D attaches the shared memory as well. Process D writes images to the shared memory, and signals process A using a message queue. Process A determines which GPU to send the job to and signals either process B or C. Finally, process B or C does the fast transfer to the GPU.

Thanks c-bro for reporting on your results, which were very helpful.

Hi David. About sharing CUDA buffers from two processes. Your information in this thread was very helpful. I am still struggling with a problem of sharing a CUDA host-pinned memory buffer shared between two linux processes. Would you consider sharing a working small piece of code example with me? my email is gonen.raveh@orbotech.com

Regards

Gonen

twbjr · December 26, 2011, 4:29am

Hi DavidS – I am working on a similar problem and I was hoping you would not mind sharing a snipet of your code. If you could copy directly to the forums or I could send you my personal email address? If I understand right, you were able to allocate a pinned buffer in host memory using cudaMallocHost on one process and then share that memory across other processes? This is exactly what I am trying to do.

Thanks in advance…

info5kht0 · May 12, 2018, 6:27pm

Hi DavidS,

Have the same problem, try to pin shared host memory with cuMemHostRegister on MacOs, but it won’t work. Maybe you can share more information about how to manage that, maybe some code ?