How can I keep a device mem pointer for locating data in device mem in furture?

uboat · July 30, 2010, 3:21am

In my program, there are lots of data ready. To reduce data transfer cost, I want to transfer all required data to GPU mem at the very beginning. In the furture computation, I want to just find those data and do the computation directly. In the following, I will show the pseudo code (the real code is written in C not C++ )

the cuda code has been compiled into a lib file. I just need to call the function.

In the following, I want to compute the matrix-vector multipilication. Since the matrix is fixed and very large and the vector is random and small, I want to store the matrix at the very beginning such that in the furture, for different vector input, I only need to transfer the vector but not need to send the large matrix to device at each time.

======================================
main.c

extern void gpustore(float *mat, float *loc);
extern void compute(float *loc, float *vec);

main(){
float mat = (float)malloc(size…);
float vec = (float)malloc(size…);
float *loc;

 initial(mat);//initial all the elements
 initial(vec);

 gpustore(mat, loc);
 
 compute(loc, vec);

}

============================================
gpucompute.lib <==== gpu.cu
gpu.cu

extern “C” void gpustore(float *, float *);
extern “C” void compute(float *, float *);

void gpustore(float mat, float loc){
cudaMalloc((void) &loc, size_mat);
cudaMemcpy(loc, mat, size_mat, cudaMemcpyHostToDevice);
}

void compute(float *loc, float vec){
float d_vec;
cudaMalloc((void) &d_vec, size_vec);
cudaMemcpy(d_vec, vec, size_vec, cudaMemcpyHostToDevice);

   kernel<<<>>>(loc, d_vec); // since I have stored the pointer to the matrix in device mem, I'm trying to use the loc to locate the matrix data.

}

===========================================
For the above codes, can I succeed locating the matrix data whic has been transfered to gpu mem not currently?

gthazmatt · July 30, 2010, 4:12am

That looks like it should be fine (I do something similar in my code), though you should use cudaMallocHost instead of malloc for mat and vec as page-locked memory is much faster to transfer. Have you tried to run it yet?