In my program, there are lots of data ready. To reduce data transfer cost, I want to transfer all required data to GPU mem at the very beginning. In the furture computation, I want to just find those data and do the computation directly. In the following, I will show the pseudo code (the real code is written in C not C++ )
the cuda code has been compiled into a lib file. I just need to call the function.
In the following, I want to compute the matrix-vector multipilication. Since the matrix is fixed and very large and the vector is random and small, I want to store the matrix at the very beginning such that in the furture, for different vector input, I only need to transfer the vector but not need to send the large matrix to device at each time.
======================================
main.c
extern void gpustore(float *mat, float *loc);
extern void compute(float *loc, float *vec);
main(){
float mat = (float)malloc(size…);
float vec = (float)malloc(size…);
float *loc;
initial(mat);//initial all the elements
initial(vec);
gpustore(mat, loc);
compute(loc, vec);
}
============================================
gpucompute.lib <==== gpu.cu
gpu.cu
extern “C” void gpustore(float *, float *);
extern “C” void compute(float *, float *);
void gpustore(float mat, float loc){
cudaMalloc((void) &loc, size_mat);
cudaMemcpy(loc, mat, size_mat, cudaMemcpyHostToDevice);
}
void compute(float *loc, float vec){
float d_vec;
cudaMalloc((void) &d_vec, size_vec);
cudaMemcpy(d_vec, vec, size_vec, cudaMemcpyHostToDevice);
kernel<<<>>>(loc, d_vec); // since I have stored the pointer to the matrix in device mem, I'm trying to use the loc to locate the matrix data.
}
===========================================
For the above codes, can I succeed locating the matrix data whic has been transfered to gpu mem not currently?