GPU Inter-Process Communications(IPC) question

I am trying to write a program that writes a few values to an array on the GPU and then have another completely different program go into GPU global memory and retrieve the values written by program no. 1. On host memory this is fairly straight forward by using shmget() and similar functions and creating a shared memory segment. Does anyone know if this is possible with GPU memory? So far I have tried using some of the Runtime API IPC calls. I have been able to write one program that uses cudaMalloc() to allocate the array and then uses cudaMemcpy to write the array values to the GPU. After this I use cudaIpcGetMemHandle() with the pointer returned by cudaMalloc to get a cudaIpcMemHandle_t type. How do I then uses this type to pass to program no. 2 to read the values written by program 1? Does anyone have any bright ideas? I have looked at the IPC example in the sample codes, but this doesn’t answer my question on how to pass the cudaIpcMemHandle_t type between programs. Thanks.

unix domain sockets

pipes/fifos
http://linux.die.net/man/3/mkfifo

mmap()
http://linux.die.net/man/2/mmap

And if you create one process as a child of the other process, I think that presents other possibilities as well.

This seems to be a good resource:

http://www.advancedlinuxprogramming.com/alp-folder/alp-ch05-ipc.pdf

The CUDA IPC sample code may also be of interest:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simpleipc

And the CUDA runtime IPC API:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE

without subtracting anything from what cuda maestro txbob noted;

i would prefer something ‘childish’ as it would naturally yield the level of organization/ ‘loose synchronization’ you likely would require:

a) 2 generally needs to know that 1 has already allocated the array, and has completed writing to it
b) something needs to clean up - it would likely be 1. 1 then needs to know 2 has completed reading

if child threads share the same process memory space, i do not see why you can not simply pass a pointer to the array (to children), or (a pointer to) a structure containing such data; the same would apply to shared memory (between the processes); if 1 receives a pointer on allocation, and is able to access the array via the pointer, then 2 should be able to access the same array, when passed (a copy of) the pointer

to date, the assumption seems to be that 1 and 2 are on the same machine/ node

The cuda IPC sample code demonstrates the use of mmap() to pass IPC handles between processes.

For amusement purposes, I tried implementing the fifo method (named pipe) that I referenced above, to demonstrate a different approach. The following is the code of the two independent applications:

app1.cu:

// app 1, part of a 2-part IPC example
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#define DSIZE 1

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

int main(){
  system("rm -f testfifo");  // remove any debris
  int ret = mkfifo("testfifo", 0600); // create fifo
  if (ret != 0) {printf("mkfifo error: %d\n",ret); return 1;}
  int *data;
  cudaMalloc(&data, DSIZE*sizeof(int));
  cudaCheckErrors("malloc fail");
  cudaMemset(data, 0, DSIZE*sizeof(int));
  cudaCheckErrors("memset fail");
  cudaIpcMemHandle_t my_handle;
  cudaIpcGetMemHandle(&my_handle, data);
  unsigned char handle_buffer[sizeof(my_handle)+1];
  memset(handle_buffer, 0, sizeof(my_handle)+1);
  memcpy(handle_buffer, (unsigned char *)(&my_handle), sizeof(my_handle));
  cudaCheckErrors("get IPC handle fail");
  FILE *fp;
  printf("waiting for app2\n");
  fp = fopen("testfifo", "w");
  if (fp == NULL) {printf("fifo open fail \n"); return 1;}
  for (int i=0; i < sizeof(my_handle); i++){
    ret = fprintf(fp,"%c", handle_buffer[i]);
    if (ret != 1) printf("ret = %d\n", ret);}
  fclose(fp);
  sleep(2);  // wait for app 2 to modify data
  int *result = (int *)malloc(DSIZE*sizeof(int));
  cudaMemcpy(result, data, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
  if (!(*result)) printf("Fail!\n");
  else printf("Success!\n");
  system("rm testfifo");
  return 0;
}

app2.cu:

// app 2, part of a 2-part IPC example
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#define DSIZE 1

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__global__ void set_kernel(volatile int *d, int val){
  *d = val;
}

int main(){
  int *data;
  cudaIpcMemHandle_t my_handle;
  unsigned char handle_buffer[sizeof(my_handle)+1];
  memset(handle_buffer, 0, sizeof(my_handle)+1);
  FILE *fp;
  fp = fopen("testfifo", "r");
  if (fp == NULL) {printf("fifo open fail \n"); return 1;}
  int ret;
  for (int i = 0; i < sizeof(my_handle); i++){
    ret = fscanf(fp,"%c", handle_buffer+i);
    if (ret == EOF) printf("received EOF\n");
    else if (ret != 1) printf("fscanf returned %d\n", ret);}
  memcpy((unsigned char *)(&my_handle), handle_buffer, sizeof(my_handle));
  cudaIpcOpenMemHandle((void **)&data, my_handle, cudaIpcMemLazyEnablePeerAccess);
  cudaCheckErrors("IPC handle fail");
  set_kernel<<<1,1>>>(data, 1);
  cudaDeviceSynchronize();
  cudaCheckErrors("memset fail");
  return 0;
}

If you run app1, it will start up and then wait for app2 to start. Then, when you start app2, app1 will send the IPC handle to app2. app2 will use that handle to modify some memory allocated by app1, and then exit. app1 waits (sleeps) for a short period, then checks to see if the modification was made by app2.

Just a proof of concept.

@txbob, thank you so much for this code, you are indeed a CUDA meastro. I took the liberty of modifying your code so that app1 writes three floats to GPU memory:

//app 1, part of a 2-part IPC example                                                                                                                                                                                                                                                                                    
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#define DSIZE 3

#define cudaCheckErrors(msg) \
  do { \
  cudaError_t __err = cudaGetLastError(); \
  if (__err != cudaSuccess) { \
  fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
    msg, cudaGetErrorString(__err), \
          __FILE__, __LINE__); \
  fprintf(stderr, "*** FAILED - ABORTING\n"); \
  exit(1); \
  } \
  } while (0)

int main(){
  system("rm -f testfifo"); // remove any debris                                                                                                                                                                                                                                                                         
  int ret = mkfifo("testfifo", 0600); // create fifo                                                                                                                                                                                                                                                                     
  if (ret != 0) {printf("mkfifo error: %d\n",ret); return 1;}

  float h_nums[] = {1.1111, 2.2222, 3.141592654};
  float *data;
  cudaIpcMemHandle_t my_handle;
  cudaMalloc(&data, DSIZE*sizeof(float));
  cudaCheckErrors("malloc fail");
  //cudaMemset(data, 0, DSIZE*sizeof(int));                                                                                                                                                                                                                                                                              
  //cudaCheckErrors("memset fail");                                                                                                                                                                                                                                                                                      
  cudaMemcpy(data, h_nums, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
  cudaCheckErrors("memcoy fail");
  cudaIpcGetMemHandle(&my_handle, data);
  unsigned char handle_buffer;
  memset(handle_buffer, 0, sizeof(my_handle)+1);
  memcpy(handle_buffer, (unsigned char *)(&my_handle), sizeof(my_handle));
  cudaCheckErrors("get IPC handle fail");
  FILE *fp;
  printf("waiting for app2\n");
  fp = fopen("testfifo", "w");
  if (fp == NULL) {printf("fifo open fail \n"); return 1;}
  for (int i=0; i < sizeof(my_handle); i++){
    ret = fprintf(fp,"%c", handle_buffer[i]);
    if (ret != 1) printf("ret = %d\n", ret);}
  fclose(fp);
  sleep(2); // wait for app 2 to modify data                                                                                                                                                                                                                                                                             
  float *result = (float *)malloc(DSIZE*sizeof(float));
  cudaMemcpy(result, data, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
  if (!(*result)) printf("Fail!\n");
  else printf("Success!\n");
  system("rm testfifo");
  return 0;
}

and app2.cu:

//app 2, part of a 2-part IPC example                                                                                                          
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#define DSIZE 3

#define cudaCheckErrors(msg) \
  do { \
  cudaError_t __err = cudaGetLastError(); \
  if (__err != cudaSuccess) { \
  fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
    msg, cudaGetErrorString(__err), \
          __FILE__, __LINE__); \
  fprintf(stderr, "*** FAILED - ABORTING\n"); \
  exit(1); \
  } \
  } while (0)

int main(){
  float *data;
  float h_nums[DSIZE];

  cudaIpcMemHandle_t my_handle;
  unsigned char handle_buffer;
  memset(handle_buffer, 0, sizeof(my_handle)+1);
  FILE *fp;
  fp = fopen("testfifo", "r");
  if (fp == NULL) {printf("fifo open fail \n"); return 1;}
  int ret;
  for (int i = 0; i < sizeof(my_handle); i++){
    ret = fscanf(fp,"%c", handle_buffer+i);
    if (ret == EOF) printf("received EOF\n");
    else if (ret != 1) printf("fscanf returned %d\n", ret);}
  memcpy((unsigned char *)(&my_handle), handle_buffer, sizeof(my_handle));
  cudaIpcOpenMemHandle((void **)&data, my_handle, cudaIpcMemLazyEnablePeerAccess);
  cudaCheckErrors("IPC handle fail");
  //cudaMemset(data, 1, sizeof(float));                                                                                                        
  cudaMemcpy(h_nums, data, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
  //cudaCheckErrors("memset fail");                                                                                                            
  cudaCheckErrors("memcopy fail");
  printf("values read from GPU memory : %f %f %f\n", h_nums[0], h_nums[1], h_nums[2]);
  cudaIpcCloseMemHandle(&data);
  cudaFree(&data);
  return 0;
}

At the risk of sounding greedy could you please help me adapt this code to use mmap()? Ideally I would like app1 to terminate instead of sleep and then have app2 get the floats from GPU memory. I am new to CUDA and C and could use as much help as possible. Thanks.

When app1 terminates, its GPU context is destroyed, including any allocations. I’m not sure your request makes sense (at least, I don’t understand it.)

The cuda simpleIPC example demonstrates how to use mmap() to share handles between processes:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simpleipc

I have attached a file which contains two programs that use a shared memory segment to share a memory location. Since I have not found analogous functions for shmget(), shmat(), and shmctl() in CUDA (I have tried using Gdev, with no success), I have been trying find a way to emulate what these two programs do but with GPU memory. In the attached code, program 1 completely terminates before program 2 runs. But I see what you are saying about GPU contexts being destroyed when the spawning process terminates. I was hoping there was a way to preserve the cudaIpcMemHandle_t created in your app1 and somehow pass it to app2 even when app1 had terminated, but I guess this is impossible using a fifo file. Hope this elucidates what I am trying to do.
sharedmem.tar (20 KB)

The handle and the pointer both become invalid, when the owning process terminates. This has nothing to do with whether you use mmap(), shm, fifo, or some other mechanism for linux IPC. This is a CUDA statement.

Is still not possible to keep the shared memory available after the first program terminate?

Thanks.

You definitely need the first program to be a daemon or service process that keeps the GPU context and the mmapp’ed memory alive.

Note that I have updated my previously posted app1/app2 code in this thread to address a few errors that were brought to my attention.

One way to allow the first program to terminate using the trivial example above would be to have app2 “own” the necessary handles and allocations, and provide these handles to app1. app1 can then use those handles, and when it is finished, terminate. I’m not suggesting this covers all possible use cases, but it may be another way to think about the problem.