I’m writing 2 programs to do something like below:
app 1: do some compute and save the huge date into memory ( better in device ).
app 2: read from the memory and render it.
I’m now using the host memory to transmit the huge amount of data and it is too slow.( app1 device memory -> app1 host memory -> app2 host memory -> app2 device memory )
It would be more efficienct if they can share the same device memory.
Is there any way I can do it?
use two threads instead of two processes, pass around a context. CUDA is a user-mode driver, and being able to pass around contexts between processes however you want is a security hole waiting to happen.
The only other way would be to move all the CUDA tasks into a 3rd process, and have your other two big applications communicate with it using some kind of interprocess communication (shared memory, pipes, whatever works).
I have not tried using shared memory and cuda, but from the application point of view it looks like regular memory (void *).
The only other alternative I would see is to look into the clone system call. Clone starts separate processes but they can share memory segments, file descriptor tables, i/o contexts, etc.
Take a look at the all the flags.
You can make a small starter process that will start the big applications, using clone. Then the two big processes can share resources like cuda device memory hopefully (won’t work if cuda uses something that isn’t shared. ). You would have to use synchronization as well, and maybe call cudainit once. Experiment.
edit: after clone is called with a function pointer in the helper process, exec* should be called. But I am unsure if this will clobber the tables and make new ones.
Long ago, back in one of the first CUDA releases, I had a program that accidentally setup a device context before calling fork() (which is just implemented with clone() on Linux), and both processes were able to use the CUDA device I had configured. I was told by an NVIDIA employee in the forum that such usage was not intended, and I should not rely on it in the future. So clone() might still work, but this is definitely unsupported CUDA usage and could break in any release.