How to make host pinned shared memory across process fork(2)?

I’d like to know whether it is available to implement the following scenario using CUDA driver APIs.

A process preliminary allocates a host shared-memory segment, then registers it as host-pinned memory region (probably, using cuMemHostRegister). This process also listen TCP/IP connection, then fork(2) a child process for each connection.
This child process shall inherit the memory map of the parent process, so the shared-memory segment should be visible to the child process also. Then, this child process kicks GPU kernel to process data on the shared-memory segment.

In my investigation, we can use cuMemHostGetFlags() to prove whether a particular host address is located on a host-pinned memory regison, or not. Once I call this function to the shared-memory segment of the parent process, it returned CUDA_SUCCESS.
However, it returned CUDA_ERROR_NOT_INITIALIZED on the client process, then it still returned CUDA_ERROR_INVALID_VALUE even if I injected cuInit() just after the process fork(2).

Is there any good idea to keep the state of host-pinned shared memory across process fork(2)?
The above my application manages 100GB-200GB data on the shared-memory by the parent process, then child process references several GB randomly, but unpredictable location, so it is not an option to call cuMemHostRegister for each process fork(2).

“The above my application manages 100GB-200GB data on the shared-memory by the parent process, then child process references several GB randomly”

how much memory do you pin at a time then - how much pinned memory do you allocate?

how much memory do you pin at a time then - how much pinned memory do you allocate?

All of them. I’d like to pin the 100GB-200GB shared memory buffer at a time, then
also want child processes to reference this area without preparation everytime.

as far as i understand, pinned memory implies page locked memory, which in turn implies ‘significant drainage on system resources’

how much memory do you have in your host?

how much memory do you have in your host?

384GB (24x16GB DIMM) with 2 Xeon processors.
Our application server tries to allocate 67% of system RAM for the shared-buffer to preload the data.
Remaining 128GB is enough to work other system stuff including OS and applications.

how did you allocate the pinned memory - which api did you use?

According to the system configuration, both SysV shared memory (shmget(2)) and POSIX shared memory (shm_open(3)) can be used.
Once application got allocated the shared memory segment, I kicks cuMemHostRegister() on the segment. It is all I did to pin the host shared memory segment.

why must it be shared memory?

and how many times do you call cuMemHostRegister() on the ‘shared memory’?

personally, i am not sure whether i would pin shared memory; and secondly, i would prefer cudaHostAlloc() over cudaHostRegister()

it is rather a sound question - can/ should shared memory be pinned…?

Ages back there was a considerable advantage in using pinned host memory, ie user buffer
(host virtual memory) locked into (host) physical memory by the host operating system.
It avoided the nVidia driver copying on the host PC from user memory to system memory.
With a large data transfer this could make the transfer twice as fast. I suspect with
more recent NV drivers, this is no longer true.
Also I guess you are allowing your GPU to access the host server’s memory directly
(at one time this was called zero copy), so I guess the driver is not involved at all.
However I’m sure you do not want the host operating system to page-out to disk buffers
on the host which the GPU is about to use.
BTW “shared memory” is a highly confusing term to use in a CUDA discussion.
Bill

why must it be shared memory?

Because fork(2)'d child process may update the contents of the host shared memory
simultaneously.

and how many times do you call cuMemHostRegister() on the ‘shared memory’?

I intends to call cuMemHostRegister() on the host shared memory once at the system
starting-up time, then I’d like to know the way to hand-over the memory pinning
status to the child process that is fork(2)'d from the above parent.

personally, i am not sure whether i would pin shared memory; and secondly,
i would prefer cudaHostAlloc() over cudaHostRegister()

In my cases, pinned private memory does not make sense…

it is rather a sound question - can/ should shared memory be pinned…?

I’d also like to know. NV’s driver is not open source.

Ages back there was a considerable advantage in using pinned host memory, ie user buffer
(host virtual memory) locked into (host) physical memory by the host operating system.
It avoided the nVidia driver copying on the host PC from user memory to system memory.
With a large data transfer this could make the transfer twice as fast. I suspect with
more recent NV drivers, this is no longer true.

It’s new for me. Did NV announced something about this feature somewhere?
If you can introduce articles/slides, it’s helpful for me.

Also I guess you are allowing your GPU to access the host server’s memory directly
(at one time this was called zero copy), so I guess the driver is not involved at all.
However I’m sure you do not want the host operating system to page-out to disk buffers
on the host which the GPU is about to use.

Not correct. The host address space I want to map is much larger than GPU’s
address space, so I guess zero-copy feature does not work well (even not
tested yet).

BTW “shared memory” is a highly confusing term to use in a CUDA discussion.

Yep. I like OpenCL’s naming convention more; private, local and global.

“Ages back there was a considerable advantage in using pinned host memory”

this is not disputed, and i doubt whether anything has changed

“so I guess the driver is not involved at all”

on the contrary, it seems the driver is much involved, if i read the api documentation on pinned memory and programming guide correctly

“I intends to call cuMemHostRegister() on the host shared memory once at the system
starting-up time”

once is fine, more than once may cause problems though, if i understand the api correctly

“I’d like to know the way to hand-over the memory pinning
status to the child process that is fork(2)'d from the above parent”

i would simply have the parent create or register pinned memory, and then just pass the appropriate, resultant pointer to the children

“Because fork(2)'d child process may update the contents of the host shared memory
simultaneously”

i still do not see why this deems shared memory compulsory; it seems you wish to ensure visibility/ synchronization of data via shared memory, and i am not convinced that it would be the optimal approach
shared memory may be great for inter-process, but you do not require inter-process; and compared to other approaches, i would deem shared memory more expensive
you mentioned shmget(2), so by shared memory, i take it that you imply: shared memory

this is my objection to pinning shared memory: on the device, shared memory is a hardware phenomenon; on the host, shared memory is more of a software phenomenon

your case is rather unique; i suggest writing a small program to validate concept
in the program, have the parent create and pin memory, rather small in size, write known values to it, pass the pointer to the children, and see if the children can read back the same data
if it works for a small/ short memory array, it should work for a large array, more or less

I think, in short, the scenario you describe will not be possible, exactly.

First of all, I would recommend that you familiarize yourself with the cuda simpleIPC application:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simpleipc

Although it’s not a direct implementation of your scenario, it illustrates some important concepts. One important concept is that a CUDA context should not be established in a parent process if the GPU(s) are intended to be used in a child process. Fork the process first, then establish the CUDA context. CUDA contexts are unique to each process, and in general are not shareable amongst processes (although contexts from separate child processes can coexist, and “share” devices).

The above info should shed some light on why you are seeing errors in the child process when you have performed a CUDA operation in the parent process before the fork.

Regarding your objective, in a nutshell, I don’t believe you will be able to share a mapped pointer in any way. The process of host allocation, and pinning of the memory, is something that can be done once. But the mapping registration process is something that will have to be replicated in each cuda context, where you want to use the host pointer as a mapped device pointer.

To be sure, I’ve tested a few other scenarios, such as attempting to share the mapped pointer via cudaIPC, and also attempting to directly extract a device pointer from a shmget/shmat pointer that has been registered/pinned/mapped in another process. Niether idea works. (And, pondering it, I’m not surprised.)

I presume your motivation is that you don’t want to pay the overhead of mapping/pinning, and that is the reason for your statement:

" it is not an option to call cuMemHostRegister for each process fork(2)."

Some of the overhead is mitigated by (allocating and) pinning in one process, leaving only the mapping operation to be performed in other processes. But as a quick test, however, the mapping operation was significant. The map/pin process required 0.7s for a 1GB allocation. The map-only process required 0.4s (CUDA 6.5/RHEL 6.2) per process.

If the large overhead is what you are trying to avoid, the best suggestion I can offer is to consider if you can run your application as a multi-threaded one rather than a multi-process one. This will result in a significant reduction in code complexity, and also avoid the multiple-context/multiple-process overhead issues I’ve mentioned here.

I can share some code that I used to evaluate some of this, if you’re interested. However it doesn’t use the driver API nor does it demonstrate a way to achieve exactly what you want, so I’ve omitted it for now. It’s just a hacked up version of the simpleIPC app.

“One important concept is that a CUDA context should not be established in a parent process if the GPU(s) are intended to be used in a child process. Fork the process first, then establish the CUDA context. CUDA contexts are unique to each process, and in general are not shareable amongst processes (although contexts from separate child processes can coexist, and “share” devices)”

txbob, are you thus saying that 2 child cuda contexts, of the same process, running on the same host and utilizing the same device, can not share a common (device) pointer, created beforehand?
thus, it is not possible for a (parent) process to allocate device memory, instigate a number of child threads, and pass the pointer of the allocated device memory to the children?
if so, could you explain this, please
i do not see why 2 cuda contexts can not share the same device memory, just as 2 host threads can/ would share the same host memory

never mind, i was confusing pthread with fork