How can I keep a device mem pointer for locating data in device mem in furture?

In my program, there are lots of data ready. To reduce data transfer cost, I want to transfer all required data to GPU mem at the very beginning. In the furture computation, I want to just find those data and do the computation directly. In the following, I will show the pseudo code (the real code is written in C not C++ )

the cuda code has been compiled into a lib file. I just need to call the function.

In the following, I want to compute the matrix-vector multipilication. Since the matrix is fixed and very large and the vector is random and small, I want to store the matrix at the very beginning such that in the furture, for different vector input, I only need to transfer the vector but not need to send the large matrix to device at each time.

======================================
main.c

extern void gpustore(float *mat, float *loc);
extern void compute(float *loc, float *vec);

main(){
float mat = (float)malloc(size…);
float vec = (float)malloc(size…);
float *loc;

 initial(mat);//initial all the elements
 initial(vec);

 gpustore(mat, loc);
 
 compute(loc, vec);

}

============================================
gpucompute.lib <==== gpu.cu
gpu.cu

extern “C” void gpustore(float *, float *);
extern “C” void compute(float *, float *);

void gpustore(float mat, float loc){
cudaMalloc((void
) &loc, size_mat);
cudaMemcpy(loc, mat, size_mat, cudaMemcpyHostToDevice);
}

void compute(float *loc, float vec){
float d_vec;
cudaMalloc((void
) &d_vec, size_vec);
cudaMemcpy(d_vec, vec, size_vec, cudaMemcpyHostToDevice);

   kernel<<<>>>(loc, d_vec); // since I have stored the pointer to the matrix in device mem, I'm trying to use the loc to locate the matrix data.

}

===========================================
For the above codes, can I succeed locating the matrix data whic has been transfered to gpu mem not currently?

That looks like it should be fine (I do something similar in my code), though you should use cudaMallocHost instead of malloc for mat and vec as page-locked memory is much faster to transfer. Have you tried to run it yet?