How to pass out the pointer which is the address of data on GPU remain the data on GPU, using the p

soyvialyh · December 22, 2011, 4:27pm

Hi, everyone.

I met a problem in my project. There are some large volume data that will be used as input for several GPU programs. Because the transferring data from host to device is time consuming. I am thinking whether we can load the data to GPU one time and remain them on the GPU and pass out the pointer which is the address of the data out. When we want to reuse those data, we can just pass this pointer to the new GPU program and find them on GPU. Is this reasonable?

I wrote a test program about this thought. Because I use matlab on mac laptop. So I wrote a mex function. The GPU program is just to double the density of each pixel of an image. I copied the address of input data and the result data to a long type variable on CPU and pass out to matlab. But sometimes matlab will quit unexpectedly, sometimes the mexPrintf will print out: "the address of d_img_float on GPU is 0x4836866000921821184

the address of d_outimg_float on GPU is 0x4872894797949173760"

which looks not right.

The test program is attached here. I am not good at using pointer in C. So maybe there are some silly errors in it.

#include <string.h>

#include <math.h>

#include "mex.h"

#include "cutil.h"

void gpuAllocMemory(float** ptr_dev, int mem_size)

{

    cudaMalloc((void**)ptr_dev, mem_size);

}

void gpuTransferToDevice(float* ptr_dev, float* ptr_host,int mem_size)

{

    cudaMemcpy(ptr_dev, ptr_host, mem_size,cudaMemcpyHostToDevice);  // memory copy

}

void gpuTransferAddressToHost(long* ptr_host, long* ptr_dev,int mem_size)

{

    cudaMemcpy(ptr_host, ptr_dev, mem_size,cudaMemcpyDeviceToHost);  // memory copy

}

void gpuFreeMemory(float *ptr_dev)

{

    cudaFree(ptr_dev);

}

void copyDoubleToFloat(float *f, double *d, int size)

{

    int i;

    for (i=0;i<size;i++) *(f+i)=(float) *(d+i);

}

__global__ void doubleX_kernel( float *in_image , int cols, int rows, float* out_image )

{

    int x = blockIdx.x*blockDim.x+threadIdx.x;	 

    int y = blockIdx.y*blockDim.y+threadIdx.y;	

    out_image[y*cols+x] = 2*in_image[y*cols+x];

 }

void gpuDoubleX(float* d_img, int Cols, int Rows, float* d_outimg )

{

    dim3 dimBlock(8, 8, 1);

    dim3 dimGrid((Cols+ dimBlock.x -1) / dimBlock.x, (Rows+ dimBlock.y -1) / dimBlock.y, 1);	

	

	

	doubleX_kernel<<< dimGrid, dimBlock,0 >>>( d_img, Cols, Rows,d_outimg);

}

void mexFunction(int nlhs, mxArray *plhs[],int nrhs,const mxArray *prhs[])

{    

    double* h_img_double;

//    double* d_outimg_double;

    float* h_img_float;

    float* d_img_float;

    float* d_outimg_float;

    mwSize *dimsizes;  /* number of elements in each dimension*/

long* h_pointer;

    int Cols, Rows;

    h_img_double =  mxGetPr(prhs[0]);

    dimsizes=(mwSize *)mxGetDimensions(prhs[0]);

    Cols = dimsizes[0];    

    Rows = dimsizes[1];     

h_img_float  = (float*) mxCalloc(Cols*Rows,sizeof(float));

    copyDoubleToFloat(h_img_float,h_img_double,Cols*Rows);

gpuAllocMemory(&d_img_float, Cols*Rows*sizeof(float));

gpuAllocMemory(&d_outimg_float, Cols*Rows*sizeof(float));     

gpuTransferToDevice(d_img_float, h_img_float, Cols*Rows*sizeof(float));

gpuDoubleX(d_img_float,Cols,Rows,d_outimg_float);

h_pointer=(long*) mxCalloc(2,sizeof(long));

    plhs[0] = mxCreateDoubleMatrix(2, 1, mxREAL); //create an mxArray     

h_pointer =(long*) mxGetPr(plhs[0]);  

gpuTransferAddressToHost(h_pointer,(long*) &d_img_float[0],  sizeof(long));  // copy the address of input data to h_pointer

    gpuTransferAddressToHost(h_pointer+1,(long*) &d_outimg_float[0],  sizeof(long));  // copy the address of output data to h_pointer+1

mexPrintf("the address of d_img_float on GPU is 0x%ld\n",*h_pointer); 

    mexPrintf("the address of d_outimg_float on GPU is 0x%ld\n",*(h_pointer+1)); 

mxFree(h_img_float);

    gpuFreeMemory(d_img_float);

    gpuFreeMemory(d_outimg_float);

    mxFree(h_pointer);

}

Could you please help me to figure it out? Thank you so much.

A.Goude · December 22, 2011, 5:55pm

Hello

First a few comments about your pointers:

On the lines

gpuTransferAddressToHost(h_pointer,(long*) &d_img_float[0],  sizeof(long));  // copy the address of input data to h_pointer

gpuTransferAddressToHost(h_pointer+1,(long*) &d_outimg_float[0],  sizeof(long));  // copy the address of output data to h_pointer+1

you are not actually transferring the addresses of the pointers, but the values they are pointing to. You already have the addresses to the GPU variables in d_img_float and d_outimg_float, and there is no need to make any data transfers between GPU and CPU to obtain these.

Second, on the lines

mexPrintf("the address of d_img_float on GPU is 0x%ld\n",*h_pointer); 

mexPrintf("the address of d_outimg_float on GPU is 0x%ld\n",*(h_pointer+1));

you are not printing the pointer addresses, but what they point to, which, since they contains the values of d_img_float[0] and d_outimg_float[0], will not be valid addresses in the CPU, which likely causes segmentation fault. (If you want to print a pointer address, you can use %p in printf)

Third, h_pointer is allocated as the output of the function, but you are calling mxFree on it, basically making the output of the mex-function invalid. Hence, mxFree should be removed.

When it comes to the possibility to data on the GPU, I did try it with matlab and it did work for me as long as I used the same mex-function both times (but returning to matlab between). If you do not call cudaFree on an array, it will not automatically be deallocated when the mex-function returns. I can not guarantee that this is a reliable behavior though. Note that just sending the address to matlab will not work between different mex-functions, as the function tries to access data that does not belong to it. I get INVALID_ARGUMENT in this case.

I would also recommend you to check for errors after each cuda call.

Regards

soyvialyh · December 22, 2011, 6:23pm

Thank you so much for your timely reply.

Yes, I am indeed confused about how to transfer an address itself but not the data the pointer pointing to.

In the command:

cudaMemcpy(ptr_host,ptr_dev,mem_size,cudaMemcpyDeviceToHost);

it will transfer data which is pointed by the pointer ptr_dev to CPU and put them in memory with the address ptr_host.

But if I just want to transfer the value of this pointer ptr_dev, how can I do it?

You mean that I can put several GPU programs in the same mex-functions, so that I can directly use the pointer d_img_float on GPU? But the whole process cannot be put in the one mex-function, in this case ,we have to load the data again and again?

Thank you again for your reply.

Hello

First a few comments about your pointers:

On the lines
gpuTransferAddressToHost(h_pointer,(long*) &d_img_float[0],  sizeof(long));  // copy the address of input data to h_pointer

gpuTransferAddressToHost(h_pointer+1,(long*) &d_outimg_float[0],  sizeof(long));  // copy the address of output data to h_pointer+1
you are not actually transferring the addresses of the pointers, but the values they are pointing to. You already have the addresses to the GPU variables in d_img_float and d_outimg_float, and there is no need to make any data transfers between GPU and CPU to obtain these.

Second, on the lines
mexPrintf("the address of d_img_float on GPU is 0x%ld\n",*h_pointer); 

mexPrintf("the address of d_outimg_float on GPU is 0x%ld\n",*(h_pointer+1));
you are not printing the pointer addresses, but what they point to, which, since they contains the values of d_img_float[0] and d_outimg_float[0], will not be valid addresses in the CPU, which likely causes segmentation fault. (If you want to print a pointer address, you can use %p in printf)

Third, h_pointer is allocated as the output of the function, but you are calling mxFree on it, basically making the output of the mex-function invalid. Hence, mxFree should be removed.

When it comes to the possibility to data on the GPU, I did try it with matlab and it did work for me as long as I used the same mex-function both times (but returning to matlab between). If you do not call cudaFree on an array, it will not automatically be deallocated when the mex-function returns. I can not guarantee that this is a reliable behavior though. Note that just sending the address to matlab will not work between different mex-functions, as the function tries to access data that does not belong to it. I get INVALID_ARGUMENT in this case.

I would also recommend you to check for errors after each cuda call.

Regards

A.Goude · December 22, 2011, 6:40pm

You just use the pointer itself. For example

h_pointer[0]=(long)d_img_float

copies the address of the pointer to the output vector h_pointer. Remember that even though the data is on the GPU, the pointer to it is obtained from cudaMalloc, and is therefore stored in host memory.

For the second point, the best is if all data can be handled inside one mex-file, without even returning to Matlab. Even though it could work to save the pointer address in a double in matlab, it is not designed for this, and it is therefore hard to determine how reliable it is. Sending the address to a different mex function will not be as easy to get working.

You could take a look at the different GPU-support projects for matlab, which introduces GPU variables into the Matlab workspace. My guess is that these would be a better solution.

Regards

soyvialyh · December 22, 2011, 6:49pm

I see. So, we can use the pointer d_img_float on CPU directly but cannot get the data it points to on CPU.

Thank you for your answer and suggestion.

Could you do me a favor again to tell me where I can find some sample codes about GPU programming for matlab ?

You just use the pointer itself. For example
h_pointer[0]=(long)d_img_float
copies the address of the pointer to the output vector h_pointer. Remember that even though the data is on the GPU, the pointer to it is obtained from cudaMalloc, and is therefore stored in host memory.

For the second point, the best is if all data can be handled inside one mex-file, without even returning to Matlab. Even though it could work to save the pointer address in a double in matlab, it is not designed for this, and it is therefore hard to determine how reliable it is. Sending the address to a different mex function will not be as easy to get working.

You could take a look at the different GPU-support projects for matlab, which introduces GPU variables into the Matlab workspace. My guess is that these would be a better solution.

Regards

A.Goude · December 23, 2011, 10:38am

When it comes to GPU support for Matlab, you could look at e.g. Jacket (proprietary) or GPUmat (free)

I do not know any good source for cuda code for Matlab, but if you write your own cuda code, the differences between traditional applications and mex-functions are quite small. Once you have handled the interface with Matlab, you can basically write code the normal way. As an example, I am adding a code that performs vector addition, just to show how it can be done with a mex file. (Note that in this case, it would be much faster to make the sum directly on the CPU instead)

Once you get this working, look at any traditional source of information about writing cuda code.

#include "cuda.h"

#include "mex.h"

#include "matrix.h"

#define BLOCKCOUNT 500

#define THREADCOUNT 256

/*---------------------------------------------------------*/

void cudasafe(cudaError_t error) //handles errors

{

  if (error != cudaSuccess) {

    mexErrMsgTxt(cudaGetErrorString(error)); 

  }

}

/*---------------------------------------------------------*/

__global__ void vectoraddcuda(float* dest,float* source1,float* source2,int count)

{

  int i=blockDim.x * blockIdx.x + threadIdx.x;

  while(i<count) {

    dest[i]=source1[i]+source2[i];

    i+=blockDim.x*gridDim.x;

  }

}

/*---------------------------------------------------------*/

int imin(int x, int y)

{

  return x<y?x:y;

}

/*---------------------------------------------------------*/

void mexFunction(int nlhs,mxArray *plhs[],int nrhs,const mxArray *prhs[])

{

  int count,i,bc;

  float *cudadest, *cudasource1, *cudasource2, *dest, *source1, *source2;

  double* ddest, *dsource1, *dsource2;

//check the validity of the inputs

  if(nrhs<2)

    mexErrMsgTxt("Not enough input arguments");

  if(!mxIsDouble(prhs[0])||

     mxIsSparse(prhs[0])||

     mxGetNumberOfDimensions(prhs[0])!=2||

     !mxIsDouble(prhs[1])||

     mxIsSparse(prhs[1])||

     mxGetNumberOfDimensions(prhs[1])!=2)

    mexErrMsgTxt("Both input arguments must be two dimensional double matrices");

  count=mxGetNumberOfElements(prhs[0]);

  if(count!=mxGetNumberOfElements(prhs[1]))

    mexErrMsgTxt("Both input arguments must have same size");

//create output

  plhs[0]=mxCreateDoubleMatrix(mxGetM(prhs[0]), mxGetN(prhs[0]), mxREAL);

//get output and input pointers

  ddest=mxGetPr(plhs[0]);

  dsource1=mxGetPr(prhs[0]);

  dsource2=mxGetPr(prhs[1]);

//convert input to float

  source1=(float*)mxMalloc(count*sizeof(float));

  source2=(float*)mxMalloc(count*sizeof(float));

  dest=(float*)mxMalloc(count*sizeof(float));

  for(i=0;i<count;i++) {

    source1[i]=(float)dsource1[i];

    source2[i]=(float)dsource2[i];

  }

//allocate 

  cudasafe(cudaMalloc((void**)&cudasource1, count*sizeof(float)));

  cudasafe(cudaMalloc((void**)&cudasource2, count*sizeof(float)));

  cudasafe(cudaMalloc((void**)&cudadest, count*sizeof(float)));

//copy data to GPU

  cudasafe( cudaMemcpy(cudasource1, source1, count*sizeof(float), cudaMemcpyHostToDevice));

  cudasafe( cudaMemcpy(cudasource2, source2, count*sizeof(float), cudaMemcpyHostToDevice));

//launch kernel

  bc=imin(BLOCKCOUNT, (count+THREADCOUNT-1)/THREADCOUNT); //necessary number of blocks

  vectoraddcuda<<<bc, THREADCOUNT>>>(cudadest, cudasource1, cudasource2, count);

//copy data back to host

  cudasafe( cudaMemcpy(dest, cudadest, count*sizeof(float), cudaMemcpyDeviceToHost));

//convert back to double for matlab

  for(i=0;i<count;i++) {

    ddest[i]=(double)dest[i];

  }

//host cleanup

  mxFree(source1);

  mxFree(source2);

  mxFree(dest);

//gpu cleanup

  cudaFree(cudasource1);

  cudaFree(cudasource2);

  cudaFree(cudadest);

}

Regards

Topic		Replies	Views
matlab, mex files and loops CUDA Programming and Performance	12	11640	March 9, 2012
Keeping objects/pointers in graphic card memory CUDA Programming and Performance	12	8100	September 26, 2008
How do I pass a double pointers array to the device? I'm getting cudaErrorIllegalAddress CUDA Programming and Performance	12	3559	January 17, 2024
Passing gpu variables between mex functions CUDA Programming and Performance	7	7061	June 20, 2010
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2634	August 2, 2010
pointer in global device memory CUDA Programming and Performance	9	11600	November 23, 2011
Calling CUDA functions from cpp files - scope of variables CUDA Programming and Performance	7	1249	July 18, 2014
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8105	June 30, 2010
concurrent D2H+H2D transfers? CUDA Programming and Performance	5	2478	May 10, 2016
most general form for thread access? CUDA Programming and Performance	10	2531	February 21, 2010

How to pass out the pointer which is the address of data on GPU remain the data on GPU, using the p

Related topics