Help-Issue faced in sharing data array stored in GPU memory

Hi,

I want to perform the computations below:

  1. Sum of an array → through mexfunction #1
  2. Square of elements of an array and then compute the sum of squares of array elements → through mexfunction #2

To implement this logic, the matlab script below is used:

dataA = [0.5:1:10.5]
full_sum = sum(dataA);

% Mexfunction # 1 
mexcuda -v mexfn_sumdata.cu sum_data.cu 
mexfn_sumdata(dataA);

% Mexfunction # 2 
mexcuda -v mexfn_squaresumdata.cu 
mexfn_squaresumdata;

Related header file “sum_data.h”

void compute_add(double * const d_sum,                       
                 double const * const d_subimageF);

__global__ void Compute_Sum(double * const out,
                            double const * const in);
static double *Arg;

Mex function # 1: (mexfn_sumdata.cu)

#include "mex.h"
#include "cuda.h"
#include "sum_data.h"

const int NUM = 10;

void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, mxArray const *prhs[])
{ 
   double *val;
   cudaMalloc ((void **)&Arg, NUM * sizeof(Arg[0]));

   val =  (double*) mxGetData(prhs[0]);
   cudaMemcpy(Arg, val, NUM*sizeof(double), cudaMemcpyHostToDevice);

   printf("\n Inside first mex function \n");

   for (int jj=0;jj<NUM;jj++){
      printf("\n test[%d]=%f\n",jj,val[jj]);
   }

   double *h_sum_new = (double*) malloc(1*sizeof(double));
   h_sum_new[0] = 0.0f;   

   double *d_sum;
   cudaMalloc ((void **)&d_sum, 1 * sizeof(Arg[0]));

   cudaMemcpy(d_sum,&h_sum_new,1*sizeof(double),cudaMemcpyHostToDevice);
   Compute_Sum <<<1,1>>> (d_sum,Arg);
   cudaDeviceSynchronize();

   cudaMemcpy(h_sum_new, d_sum, 1*sizeof(double), cudaMemcpyDeviceToHost);
   printf("\n Sum of sub-image[0] = %f \n",h_sum_new[0]);	

}

CUDA source code: “sum_data.cu”

const int NUM = 10;

void __global__ Compute_Sum(double * const out1,   
                             double const * const in){
   for (int ii=0;ii<NUM;ii++){
      out1[0] = out1[0] + in[ii];
      printf("\n threadid = %d \t out1[%d] = %f \n", threadIdx.x, out1[0]);
   }
}

void compute_add(double * const d_sum,                       
                 double const * const d_subimageF)
{

   double *h_sum_new = (double*) malloc(NUM*sizeof(double));

   h_sum_new[0] = 0.0f;   

   cudaMemcpy(d_sum,&h_sum_new,1*sizeof(double),cudaMemcpyHostToDevice);
   Compute_Sum <<<1,1>>> (d_sum,d_subimageF);
   cudaDeviceSynchronize();

   cudaMemcpy(h_sum_new, d_sum, 1*sizeof(double), cudaMemcpyDeviceToHost);
   printf("\n Sum = %f \n\n\n",h_sum_new[0]);	
   
}

Mex function # 2: (mexfn_squaresumdata.cu)

#include "mex.h"
#include "cuda.h"
#include "sum_data.h"


const int NUM = 10;

void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, mxArray const *prhs[])
{
   double *val_new = (double*)malloc(NUM*sizeof(double));   
   cudaMemcpy(val_new,Arg,NUM*sizeof(double), cudaMemcpyDeviceToHost);

   printf("\n Inside second mex function \n");
   for (int jj=0;jj<NUM;jj++){
      printf("\n test_new[%d]=%f\n",jj,val_new[jj]);
   }
}

Problem faced: The data stored in pointer ‘Arg’ is not being accessed from mexfunction #2. Upon printing, the output is zero for all elements.

Inside second mex function

test_new[0]=0.000000

test_new[8]=0.000000

test_new[9]=0.000000

Can anyone please suggest why the data is not accessed correctly?

Thanks.

The use of the static pointer won’t allow the data (or allocation) to survive from one mexfunction to another. Copy your data from mexfunction 1 back to a matlab variable. Then pass that variable to the 2nd mexfunction.

Hi txbob,

Thanks a lot for your response. I do not want to copy the data from mexfunction 1 back to matlab. Is there any other way to access a data array stored in GPU memory by two mex functions called at different time instants?

As you mentioned, if ‘static pointer’ does not allow then can I use any other approach?

In ‘C’ programming there is ‘extern’ storage class to enable global access across multiple files. Similarly is there a feasibility to declare the variable ‘Arg’ as a global variable? Will such a declaration help data array to survive so that it can be accessed by mexfunction#2 ? If yes how can I declare it.

looking forward for suggestions.

Thanks.

Hi,

The reason why I do not prefer copying the data from mexfunction #1 back to matlab is that the example discussed above is a simplified version of the logic I would like to realize.

  1. In actual scenario, I have 10 arguments each with >10000 elements that need to be transferred through mexfunction #1 to GPU memory.

  2. The mexfunction #2 will be invoked multiple times within a loop. Hence transfer of data from GPU to matlab and then to GPU again is not a preferable choice.

I am facing this difficulty for quite sometime. It would be helpful if anyone can share your ideas.

Thanks.

What about launching both kernels from a master kernel?

https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/