I am new to CUDA programming and hope to get some help here. I have a simple program using “cublasDsyr” in a loop, and then transferring the final result back. It seems like the timing of GPU->CPU “cudaMemcpy” is affected by the number of times that cublasDsyr is called. I am really puzzled. If iteration is 1, cudaMemcpy takes about 710 us. With every added iteration the transfer time increases by about 75us! So if iteration is 10 it takes 1395us to get the results back. If I put a loop around cudaMemcpy, the first transfer takes 1395us but the subsequent times are 623us, which is about what I expect for transfer of 512*512 matrix.
I have tried calling cudaMemcopy before starting the test, so transfering the cublasDsyr results isn’t the first GPU->CPU transfer, just the first after calling cublasDsyr. I even tried to call cublasDsyr once, do a memcopy of the results, and then start the test loop (thinking that the first call to cublasDsyr routine may do something). But there is no difference. No matter what I do if I run culbasDsyr and then call cudamemcpy I get the time overhead on the first transfer.
I can’t imagine “cubasDsyr” having an effect on the memory, so I must be missing something trivial. I will post relevant pieces of the code here and really appreciate any input. Thanks…
#define N 512
double *H_Matrix, *D_Matrix, *H_vec, *D_vec;
cudaMallocHost( (void**) &H_Matrix, NNsizeof(double))
H_vec = malloc(Nsizeof(double))
cudaMalloc( (void**) &D_Matrix, NNsizeof(double))
cudaMalloc( (void**) &D_vec, Nsizeof(double))
//Copy to device
cudaMemcpy( D_Matrix, H_Matrix, NNsizeof(double),cudaMemcpyHostToDevice)
cudaMemcpy( D_vec, H_vec, N*sizeof(double),cudaMemcpyHostToDevice)
cublasDsyr(‘u’, N, 1, D_vec, 1, D_Matrix, N)
//Get results back
cudaMemcpy( D_Matrix, D_Matrix, NNsizeof(double), cudaMemcpyDeviceToHost)); //measure time for this transfer