GPU Inter-Process Communications(IPC) question

gpu_control · December 9, 2014, 8:04pm

I am trying to write a program that writes a few values to an array on the GPU and then have another completely different program go into GPU global memory and retrieve the values written by program no. 1. On host memory this is fairly straight forward by using shmget() and similar functions and creating a shared memory segment. Does anyone know if this is possible with GPU memory? So far I have tried using some of the Runtime API IPC calls. I have been able to write one program that uses cudaMalloc() to allocate the array and then uses cudaMemcpy to write the array values to the GPU. After this I use cudaIpcGetMemHandle() with the pointer returned by cudaMalloc to get a cudaIpcMemHandle_t type. How do I then uses this type to pass to program no. 2 to read the values written by program 1? Does anyone have any bright ideas? I have looked at the IPC example in the sample codes, but this doesn’t answer my question on how to pass the cudaIpcMemHandle_t type between programs. Thanks.

Robert_Crovella · December 9, 2014, 8:19pm

unix domain sockets

pipes/fifos

mmap()

And if you create one process as a child of the other process, I think that presents other possibilities as well.

This seems to be a good resource:

[url]Advancedlinuxprogramming.com

The CUDA IPC sample code may also be of interest:

[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#simpleipc[/url]

And the CUDA runtime IPC API:

[url]CUDA Runtime API :: CUDA Toolkit Documentation

little_jimmy · December 10, 2014, 8:31am

without subtracting anything from what cuda maestro txbob noted;

i would prefer something ‘childish’ as it would naturally yield the level of organization/ ‘loose synchronization’ you likely would require:

a) 2 generally needs to know that 1 has already allocated the array, and has completed writing to it
b) something needs to clean up - it would likely be 1. 1 then needs to know 2 has completed reading

if child threads share the same process memory space, i do not see why you can not simply pass a pointer to the array (to children), or (a pointer to) a structure containing such data; the same would apply to shared memory (between the processes); if 1 receives a pointer on allocation, and is able to access the array via the pointer, then 2 should be able to access the same array, when passed (a copy of) the pointer

to date, the assumption seems to be that 1 and 2 are on the same machine/ node

Robert_Crovella · December 11, 2014, 1:21am

The cuda IPC sample code demonstrates the use of mmap() to pass IPC handles between processes.

For amusement purposes, I tried implementing the fifo method (named pipe) that I referenced above, to demonstrate a different approach. The following is the code of the two independent applications:

app1.cu:

// app 1, part of a 2-part IPC example
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#define DSIZE 1

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

int main(){
  system("rm -f testfifo");  // remove any debris
  int ret = mkfifo("testfifo", 0600); // create fifo
  if (ret != 0) {printf("mkfifo error: %d\n",ret); return 1;}
  int *data;
  cudaMalloc(&data, DSIZE*sizeof(int));
  cudaCheckErrors("malloc fail");
  cudaMemset(data, 0, DSIZE*sizeof(int));
  cudaCheckErrors("memset fail");
  cudaIpcMemHandle_t my_handle;
  cudaIpcGetMemHandle(&my_handle, data);
  unsigned char handle_buffer[sizeof(my_handle)+1];
  memset(handle_buffer, 0, sizeof(my_handle)+1);
  memcpy(handle_buffer, (unsigned char *)(&my_handle), sizeof(my_handle));
  cudaCheckErrors("get IPC handle fail");
  FILE *fp;
  printf("waiting for app2\n");
  fp = fopen("testfifo", "w");
  if (fp == NULL) {printf("fifo open fail \n"); return 1;}
  for (int i=0; i < sizeof(my_handle); i++){
    ret = fprintf(fp,"%c", handle_buffer[i]);
    if (ret != 1) printf("ret = %d\n", ret);}
  fclose(fp);
  sleep(2);  // wait for app 2 to modify data
  int *result = (int *)malloc(DSIZE*sizeof(int));
  cudaMemcpy(result, data, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
  if (!(*result)) printf("Fail!\n");
  else printf("Success!\n");
  system("rm testfifo");
  return 0;
}

app2.cu:

// app 2, part of a 2-part IPC example
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#define DSIZE 1

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__global__ void set_kernel(volatile int *d, int val){
  *d = val;
}

int main(){
  int *data;
  cudaIpcMemHandle_t my_handle;
  unsigned char handle_buffer[sizeof(my_handle)+1];
  memset(handle_buffer, 0, sizeof(my_handle)+1);
  FILE *fp;
  fp = fopen("testfifo", "r");
  if (fp == NULL) {printf("fifo open fail \n"); return 1;}
  int ret;
  for (int i = 0; i < sizeof(my_handle); i++){
    ret = fscanf(fp,"%c", handle_buffer+i);
    if (ret == EOF) printf("received EOF\n");
    else if (ret != 1) printf("fscanf returned %d\n", ret);}
  memcpy((unsigned char *)(&my_handle), handle_buffer, sizeof(my_handle));
  cudaIpcOpenMemHandle((void **)&data, my_handle, cudaIpcMemLazyEnablePeerAccess);
  cudaCheckErrors("IPC handle fail");
  set_kernel<<<1,1>>>(data, 1);
  cudaDeviceSynchronize();
  cudaCheckErrors("memset fail");
  return 0;
}

If you run app1, it will start up and then wait for app2 to start. Then, when you start app2, app1 will send the IPC handle to app2. app2 will use that handle to modify some memory allocated by app1, and then exit. app1 waits (sleeps) for a short period, then checks to see if the modification was made by app2.

Just a proof of concept.

gpu_control · December 11, 2014, 6:19pm

@txbob, thank you so much for this code, you are indeed a CUDA meastro. I took the liberty of modifying your code so that app1 writes three floats to GPU memory:

//app 1, part of a 2-part IPC example                                                                                                                                                                                                                                                                                    
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#define DSIZE 3

#define cudaCheckErrors(msg) \
  do { \
  cudaError_t __err = cudaGetLastError(); \
  if (__err != cudaSuccess) { \
  fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
    msg, cudaGetErrorString(__err), \
          __FILE__, __LINE__); \
  fprintf(stderr, "*** FAILED - ABORTING\n"); \
  exit(1); \
  } \
  } while (0)

int main(){
  system("rm -f testfifo"); // remove any debris                                                                                                                                                                                                                                                                         
  int ret = mkfifo("testfifo", 0600); // create fifo                                                                                                                                                                                                                                                                     
  if (ret != 0) {printf("mkfifo error: %d\n",ret); return 1;}

  float h_nums[] = {1.1111, 2.2222, 3.141592654};
  float *data;
  cudaIpcMemHandle_t my_handle;
  cudaMalloc(&data, DSIZE*sizeof(float));
  cudaCheckErrors("malloc fail");
  //cudaMemset(data, 0, DSIZE*sizeof(int));                                                                                                                                                                                                                                                                              
  //cudaCheckErrors("memset fail");                                                                                                                                                                                                                                                                                      
  cudaMemcpy(data, h_nums, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
  cudaCheckErrors("memcoy fail");
  cudaIpcGetMemHandle(&my_handle, data);
  unsigned char handle_buffer;
  memset(handle_buffer, 0, sizeof(my_handle)+1);
  memcpy(handle_buffer, (unsigned char *)(&my_handle), sizeof(my_handle));
  cudaCheckErrors("get IPC handle fail");
  FILE *fp;
  printf("waiting for app2\n");
  fp = fopen("testfifo", "w");
  if (fp == NULL) {printf("fifo open fail \n"); return 1;}
  for (int i=0; i < sizeof(my_handle); i++){
    ret = fprintf(fp,"%c", handle_buffer[i]);
    if (ret != 1) printf("ret = %d\n", ret);}
  fclose(fp);
  sleep(2); // wait for app 2 to modify data                                                                                                                                                                                                                                                                             
  float *result = (float *)malloc(DSIZE*sizeof(float));
  cudaMemcpy(result, data, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
  if (!(*result)) printf("Fail!\n");
  else printf("Success!\n");
  system("rm testfifo");
  return 0;
}

and app2.cu:

//app 2, part of a 2-part IPC example                                                                                                          
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#define DSIZE 3

#define cudaCheckErrors(msg) \
  do { \
  cudaError_t __err = cudaGetLastError(); \
  if (__err != cudaSuccess) { \
  fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
    msg, cudaGetErrorString(__err), \
          __FILE__, __LINE__); \
  fprintf(stderr, "*** FAILED - ABORTING\n"); \
  exit(1); \
  } \
  } while (0)

int main(){
  float *data;
  float h_nums[DSIZE];

  cudaIpcMemHandle_t my_handle;
  unsigned char handle_buffer;
  memset(handle_buffer, 0, sizeof(my_handle)+1);
  FILE *fp;
  fp = fopen("testfifo", "r");
  if (fp == NULL) {printf("fifo open fail \n"); return 1;}
  int ret;
  for (int i = 0; i < sizeof(my_handle); i++){
    ret = fscanf(fp,"%c", handle_buffer+i);
    if (ret == EOF) printf("received EOF\n");
    else if (ret != 1) printf("fscanf returned %d\n", ret);}
  memcpy((unsigned char *)(&my_handle), handle_buffer, sizeof(my_handle));
  cudaIpcOpenMemHandle((void **)&data, my_handle, cudaIpcMemLazyEnablePeerAccess);
  cudaCheckErrors("IPC handle fail");
  //cudaMemset(data, 1, sizeof(float));                                                                                                        
  cudaMemcpy(h_nums, data, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
  //cudaCheckErrors("memset fail");                                                                                                            
  cudaCheckErrors("memcopy fail");
  printf("values read from GPU memory : %f %f %f\n", h_nums[0], h_nums[1], h_nums[2]);
  cudaIpcCloseMemHandle(&data);
  cudaFree(&data);
  return 0;
}

At the risk of sounding greedy could you please help me adapt this code to use mmap()? Ideally I would like app1 to terminate instead of sleep and then have app2 get the floats from GPU memory. I am new to CUDA and C and could use as much help as possible. Thanks.

Robert_Crovella · December 11, 2014, 6:38pm

When app1 terminates, its GPU context is destroyed, including any allocations. I’m not sure your request makes sense (at least, I don’t understand it.)

The cuda simpleIPC example demonstrates how to use mmap() to share handles between processes:

[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#simpleipc[/url]

gpu_control · December 11, 2014, 8:57pm

I have attached a file which contains two programs that use a shared memory segment to share a memory location. Since I have not found analogous functions for shmget(), shmat(), and shmctl() in CUDA (I have tried using Gdev, with no success), I have been trying find a way to emulate what these two programs do but with GPU memory. In the attached code, program 1 completely terminates before program 2 runs. But I see what you are saying about GPU contexts being destroyed when the spawning process terminates. I was hoping there was a way to preserve the cudaIpcMemHandle_t created in your app1 and somehow pass it to app2 even when app1 had terminated, but I guess this is impossible using a fifo file. Hope this elucidates what I am trying to do.
sharedmem.tar (20 KB)

Robert_Crovella · December 11, 2014, 11:56pm

The handle and the pointer both become invalid, when the owning process terminates. This has nothing to do with whether you use mmap(), shm, fifo, or some other mechanism for linux IPC. This is a CUDA statement.

gpu_question · October 2, 2018, 3:40pm

Is still not possible to keep the shared memory available after the first program terminate?

Thanks.

cbuchner1 · October 4, 2018, 8:29am

You definitely need the first program to be a daemon or service process that keeps the GPU context and the mmapp’ed memory alive.

Robert_Crovella · June 4, 2020, 5:27pm

Note that I have updated my previously posted app1/app2 code in this thread to address a few errors that were brought to my attention.

Robert_Crovella · August 25, 2020, 4:38pm

One way to allow the first program to terminate using the trivial example above would be to have app2 “own” the necessary handles and allocations, and provide these handles to app1. app1 can then use those handles, and when it is finished, terminate. I’m not suggesting this covers all possible use cases, but it may be another way to think about the problem.

pborhanuddin · January 4, 2023, 6:36pm

Hello Robert, sorry for commenting such an old post. I wanted to try the example code you posted avobe. I always get “Operation not supported” for the API cudaIpcGetMemHandle! My nvcc version as follows:

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:34:44_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0

Any thoughts what I can check/do to this supported? It’s a NVidia Jetson Xabier.

Thank you.

Robert_Crovella · January 4, 2023, 6:50pm

CUDA (memory-handle) IPC is not supported on Jetson. Also see here.

Topic		Replies	Views
How to access gpu memory between processes CUDA Programming and Performance	10	2656	August 4, 2023
How to improve the performance of using CUDA IPC shared memory? CUDA Programming and Performance cuda	5	209	October 23, 2024
Problem with IPC CUDA Programming and Performance	10	3432	May 27, 2020
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8103	June 30, 2010
'invalid device ordinal' (cudaErrorInvalidDevice) CUDA Programming and Performance	6	5612	August 25, 2015
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	4011	January 28, 2022
CUDA shared memory CNN (convolutional neural network) CUDA Programming and Performance	7	2415	July 21, 2017
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
cudaMalloc performance issue after p2p access is enabled CUDA Programming and Performance	10	365	June 5, 2024
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4199	May 13, 2010

GPU Inter-Process Communications(IPC) question

nvcc --version

Related topics