How to pass out the pointer which is the address of data on GPU remain the data on GPU, using the p

Hi, everyone.

I met a problem in my project. There are some large volume data that will be used as input for several GPU programs. Because the transferring data from host to device is time consuming. I am thinking whether we can load the data to GPU one time and remain them on the GPU and pass out the pointer which is the address of the data out. When we want to reuse those data, we can just pass this pointer to the new GPU program and find them on GPU. Is this reasonable?

I wrote a test program about this thought. Because I use matlab on mac laptop. So I wrote a mex function. The GPU program is just to double the density of each pixel of an image. I copied the address of input data and the result data to a long type variable on CPU and pass out to matlab. But sometimes matlab will quit unexpectedly, sometimes the mexPrintf will print out: "the address of d_img_float on GPU is 0x4836866000921821184

the address of d_outimg_float on GPU is 0x4872894797949173760"

which looks not right.

The test program is attached here. I am not good at using pointer in C. So maybe there are some silly errors in it.

#include <string.h>

#include <math.h>

#include "mex.h"

#include "cutil.h"

void gpuAllocMemory(float** ptr_dev, int mem_size)

{

    cudaMalloc((void**)ptr_dev, mem_size);

}

void gpuTransferToDevice(float* ptr_dev, float* ptr_host,int mem_size)

{

    cudaMemcpy(ptr_dev, ptr_host, mem_size,cudaMemcpyHostToDevice);  // memory copy

}

void gpuTransferAddressToHost(long* ptr_host, long* ptr_dev,int mem_size)

{

    cudaMemcpy(ptr_host, ptr_dev, mem_size,cudaMemcpyDeviceToHost);  // memory copy

}

void gpuFreeMemory(float *ptr_dev)

{

    cudaFree(ptr_dev);

}

void copyDoubleToFloat(float *f, double *d, int size)

{

    int i;

    for (i=0;i<size;i++) *(f+i)=(float) *(d+i);

}

__global__ void doubleX_kernel( float *in_image , int cols, int rows, float* out_image )

{

    int x = blockIdx.x*blockDim.x+threadIdx.x;	 

    int y = blockIdx.y*blockDim.y+threadIdx.y;	

    out_image[y*cols+x] = 2*in_image[y*cols+x];

 }

void gpuDoubleX(float* d_img, int Cols, int Rows, float* d_outimg )

{

    dim3 dimBlock(8, 8, 1);

    dim3 dimGrid((Cols+ dimBlock.x -1) / dimBlock.x, (Rows+ dimBlock.y -1) / dimBlock.y, 1);	

	

	

	doubleX_kernel<<< dimGrid, dimBlock,0 >>>( d_img, Cols, Rows,d_outimg);

}

void mexFunction(int nlhs, mxArray *plhs[],int nrhs,const mxArray *prhs[])

{    

    double* h_img_double;

//    double* d_outimg_double;

    float* h_img_float;

    float* d_img_float;

    float* d_outimg_float;

    mwSize *dimsizes;  /* number of elements in each dimension*/

long* h_pointer;

    int Cols, Rows;

    h_img_double =  mxGetPr(prhs[0]);

    dimsizes=(mwSize *)mxGetDimensions(prhs[0]);

    Cols = dimsizes[0];    

    Rows = dimsizes[1];     

h_img_float  = (float*) mxCalloc(Cols*Rows,sizeof(float));

    copyDoubleToFloat(h_img_float,h_img_double,Cols*Rows);

gpuAllocMemory(&d_img_float, Cols*Rows*sizeof(float));

gpuAllocMemory(&d_outimg_float, Cols*Rows*sizeof(float));     

gpuTransferToDevice(d_img_float, h_img_float, Cols*Rows*sizeof(float));

gpuDoubleX(d_img_float,Cols,Rows,d_outimg_float);

h_pointer=(long*) mxCalloc(2,sizeof(long));

    plhs[0] = mxCreateDoubleMatrix(2, 1, mxREAL); //create an mxArray     

h_pointer =(long*) mxGetPr(plhs[0]);  

gpuTransferAddressToHost(h_pointer,(long*) &d_img_float[0],  sizeof(long));  // copy the address of input data to h_pointer

    gpuTransferAddressToHost(h_pointer+1,(long*) &d_outimg_float[0],  sizeof(long));  // copy the address of output data to h_pointer+1

mexPrintf("the address of d_img_float on GPU is 0x%ld\n",*h_pointer); 

    mexPrintf("the address of d_outimg_float on GPU is 0x%ld\n",*(h_pointer+1)); 

mxFree(h_img_float);

    gpuFreeMemory(d_img_float);

    gpuFreeMemory(d_outimg_float);

    mxFree(h_pointer);

}

Could you please help me to figure it out? Thank you so much.

Hello

First a few comments about your pointers:

On the lines

gpuTransferAddressToHost(h_pointer,(long*) &d_img_float[0],  sizeof(long));  // copy the address of input data to h_pointer

gpuTransferAddressToHost(h_pointer+1,(long*) &d_outimg_float[0],  sizeof(long));  // copy the address of output data to h_pointer+1

you are not actually transferring the addresses of the pointers, but the values they are pointing to. You already have the addresses to the GPU variables in d_img_float and d_outimg_float, and there is no need to make any data transfers between GPU and CPU to obtain these.

Second, on the lines

mexPrintf("the address of d_img_float on GPU is 0x%ld\n",*h_pointer); 

mexPrintf("the address of d_outimg_float on GPU is 0x%ld\n",*(h_pointer+1));

you are not printing the pointer addresses, but what they point to, which, since they contains the values of d_img_float[0] and d_outimg_float[0], will not be valid addresses in the CPU, which likely causes segmentation fault. (If you want to print a pointer address, you can use %p in printf)

Third, h_pointer is allocated as the output of the function, but you are calling mxFree on it, basically making the output of the mex-function invalid. Hence, mxFree should be removed.

When it comes to the possibility to data on the GPU, I did try it with matlab and it did work for me as long as I used the same mex-function both times (but returning to matlab between). If you do not call cudaFree on an array, it will not automatically be deallocated when the mex-function returns. I can not guarantee that this is a reliable behavior though. Note that just sending the address to matlab will not work between different mex-functions, as the function tries to access data that does not belong to it. I get INVALID_ARGUMENT in this case.

I would also recommend you to check for errors after each cuda call.

Regards

Thank you so much for your timely reply.

Yes, I am indeed confused about how to transfer an address itself but not the data the pointer pointing to.

In the command:

cudaMemcpy(ptr_host,ptr_dev,mem_size,cudaMemcpyDeviceToHost);

it will transfer data which is pointed by the pointer ptr_dev to CPU and put them in memory with the address ptr_host.

But if I just want to transfer the value of this pointer ptr_dev, how can I do it?

You mean that I can put several GPU programs in the same mex-functions, so that I can directly use the pointer d_img_float on GPU? But the whole process cannot be put in the one mex-function, in this case ,we have to load the data again and again?

Thank you again for your reply.

You just use the pointer itself. For example

h_pointer[0]=(long)d_img_float

copies the address of the pointer to the output vector h_pointer. Remember that even though the data is on the GPU, the pointer to it is obtained from cudaMalloc, and is therefore stored in host memory.

For the second point, the best is if all data can be handled inside one mex-file, without even returning to Matlab. Even though it could work to save the pointer address in a double in matlab, it is not designed for this, and it is therefore hard to determine how reliable it is. Sending the address to a different mex function will not be as easy to get working.

You could take a look at the different GPU-support projects for matlab, which introduces GPU variables into the Matlab workspace. My guess is that these would be a better solution.

Regards

I see. So, we can use the pointer d_img_float on CPU directly but cannot get the data it points to on CPU.

Thank you for your answer and suggestion.

Could you do me a favor again to tell me where I can find some sample codes about GPU programming for matlab ?

When it comes to GPU support for Matlab, you could look at e.g. Jacket (proprietary) or GPUmat (free)

I do not know any good source for cuda code for Matlab, but if you write your own cuda code, the differences between traditional applications and mex-functions are quite small. Once you have handled the interface with Matlab, you can basically write code the normal way. As an example, I am adding a code that performs vector addition, just to show how it can be done with a mex file. (Note that in this case, it would be much faster to make the sum directly on the CPU instead)

Once you get this working, look at any traditional source of information about writing cuda code.

#include "cuda.h"

#include "mex.h"

#include "matrix.h"

#define BLOCKCOUNT 500

#define THREADCOUNT 256

/*---------------------------------------------------------*/

void cudasafe(cudaError_t error) //handles errors

{

  if (error != cudaSuccess) {

    mexErrMsgTxt(cudaGetErrorString(error)); 

  }

}

/*---------------------------------------------------------*/

__global__ void vectoraddcuda(float* dest,float* source1,float* source2,int count)

{

  int i=blockDim.x * blockIdx.x + threadIdx.x;

  while(i<count) {

    dest[i]=source1[i]+source2[i];

    i+=blockDim.x*gridDim.x;

  }

}

/*---------------------------------------------------------*/

int imin(int x, int y)

{

  return x<y?x:y;

}

/*---------------------------------------------------------*/

void mexFunction(int nlhs,mxArray *plhs[],int nrhs,const mxArray *prhs[])

{

  int count,i,bc;

  float *cudadest, *cudasource1, *cudasource2, *dest, *source1, *source2;

  double* ddest, *dsource1, *dsource2;

//check the validity of the inputs

  if(nrhs<2)

    mexErrMsgTxt("Not enough input arguments");

  if(!mxIsDouble(prhs[0])||

     mxIsSparse(prhs[0])||

     mxGetNumberOfDimensions(prhs[0])!=2||

     !mxIsDouble(prhs[1])||

     mxIsSparse(prhs[1])||

     mxGetNumberOfDimensions(prhs[1])!=2)

    mexErrMsgTxt("Both input arguments must be two dimensional double matrices");

  count=mxGetNumberOfElements(prhs[0]);

  if(count!=mxGetNumberOfElements(prhs[1]))

    mexErrMsgTxt("Both input arguments must have same size");

//create output

  plhs[0]=mxCreateDoubleMatrix(mxGetM(prhs[0]), mxGetN(prhs[0]), mxREAL);

//get output and input pointers

  ddest=mxGetPr(plhs[0]);

  dsource1=mxGetPr(prhs[0]);

  dsource2=mxGetPr(prhs[1]);

//convert input to float

  source1=(float*)mxMalloc(count*sizeof(float));

  source2=(float*)mxMalloc(count*sizeof(float));

  dest=(float*)mxMalloc(count*sizeof(float));

  for(i=0;i<count;i++) {

    source1[i]=(float)dsource1[i];

    source2[i]=(float)dsource2[i];

  }

//allocate 

  cudasafe(cudaMalloc((void**)&cudasource1, count*sizeof(float)));

  cudasafe(cudaMalloc((void**)&cudasource2, count*sizeof(float)));

  cudasafe(cudaMalloc((void**)&cudadest, count*sizeof(float)));

//copy data to GPU

  cudasafe( cudaMemcpy(cudasource1, source1, count*sizeof(float), cudaMemcpyHostToDevice));

  cudasafe( cudaMemcpy(cudasource2, source2, count*sizeof(float), cudaMemcpyHostToDevice));

//launch kernel

  bc=imin(BLOCKCOUNT, (count+THREADCOUNT-1)/THREADCOUNT); //necessary number of blocks

  vectoraddcuda<<<bc, THREADCOUNT>>>(cudadest, cudasource1, cudasource2, count);

//copy data back to host

  cudasafe( cudaMemcpy(dest, cudadest, count*sizeof(float), cudaMemcpyDeviceToHost));

//convert back to double for matlab

  for(i=0;i<count;i++) {

    ddest[i]=(double)dest[i];

  }

//host cleanup

  mxFree(source1);

  mxFree(source2);

  mxFree(dest);

//gpu cleanup

  cudaFree(cudasource1);

  cudaFree(cudasource2);

  cudaFree(cudadest);

}

Regards