How to share the same Device Memory between 2 process

I’m writing 2 programs to do something like below:
app 1: do some compute and save the huge date into memory ( better in device ).
app 2: read from the memory and render it.

I’m now using the host memory to transmit the huge amount of data and it is too slow.( app1 device memory -> app1 host memory -> app2 host memory -> app2 device memory )
It would be more efficienct if they can share the same device memory.
Is there any way I can do it?

I don’t think you can “share” the memory, but you could just pass along the device-memory pointer from app1 to app2, and then app2 to kernel2.

Do you understand what I mean? :)

(Note, I haven’t tried this myself, but I think it will work)

this will not work at all, there is no way to do it

Yes. I’ve tried to pass the device memory pointer to app2 and I got the “INVILA DEVICE MEMORY POINTER” error. So i think the device memory pointer has some relationship with the thread memory context.

But I do need an efficient way to SHARE or COPY device memory between 2 processes.

If i can’t do it, the CUDA things will be little useful.

I suppose that there should be nothing prevent such thing except the driver/SDK does not implement it.

use two threads instead of two processes, pass around a context. CUDA is a user-mode driver, and being able to pass around contexts between processes however you want is a security hole waiting to happen.

I got 2 very big applications and I can’t merge them into one single application.

Being able to pass around contexts between processes maybe a security hole. But I can afford that because the applications are used in our local computer.

Is there any way or tricks to make it?

The only other way would be to move all the CUDA tasks into a 3rd process, and have your other two big applications communicate with it using some kind of interprocess communication (shared memory, pipes, whatever works).

Thanks.

I got that it is impossible to copy/share device memory between processes.

But it will make CUDA more usable if nVidia try some effort to get this thing done.

Maybe you can try:

make a shared memory segment, shmget, shmat, and use a process lock to synchronize both processes.

Use cuda mapped memory with this memory (cuda manual 3.2.5.3. ) Normally this maps host memory to device memory. (zero copy)

But ther is still a data transfer route “device -> host -> device processor” which is performance bottleneck.

I don’t think that zero copy can solve this problem.

BTW, I don’t know how to map a block of shmget memory to cuda device. Is that’s possible?

I have not tried using shared memory and cuda, but from the application point of view it looks like regular memory (void *).

The only other alternative I would see is to look into the clone system call. Clone starts separate processes but they can share memory segments, file descriptor tables, i/o contexts, etc.

man clone

Take a look at the all the flags.

You can make a small starter process that will start the big applications, using clone. Then the two big processes can share resources like cuda device memory hopefully (won’t work if cuda uses something that isn’t shared. ). You would have to use synchronization as well, and maybe call cudainit once. Experiment.

edit: after clone is called with a function pointer in the helper process, exec* should be called. But I am unsure if this will clobber the tables and make new ones.

Long ago, back in one of the first CUDA releases, I had a program that accidentally setup a device context before calling fork() (which is just implemented with clone() on Linux), and both processes were able to use the CUDA device I had configured. I was told by an NVIDIA employee in the forum that such usage was not intended, and I should not rely on it in the future. So clone() might still work, but this is definitely unsupported CUDA usage and could break in any release.

True, although if cuda supports threading, and clone supports almost all os constructs as shared, I don’t see much of a difference. If it’s for non mass production use, try it out.