cublas routine and cudamemcpy speed GPU->CPU transfer speed affected by cublasDsyr call?!

Hi.
I am new to CUDA programming and hope to get some help here. I have a simple program using “cublasDsyr” in a loop, and then transferring the final result back. It seems like the timing of GPU->CPU “cudaMemcpy” is affected by the number of times that cublasDsyr is called. I am really puzzled. If iteration is 1, cudaMemcpy takes about 710 us. With every added iteration the transfer time increases by about 75us! So if iteration is 10 it takes 1395us to get the results back. If I put a loop around cudaMemcpy, the first transfer takes 1395us but the subsequent times are 623us, which is about what I expect for transfer of 512*512 matrix.

I have tried calling cudaMemcopy before starting the test, so transfering the cublasDsyr results isn’t the first GPU->CPU transfer, just the first after calling cublasDsyr. I even tried to call cublasDsyr once, do a memcopy of the results, and then start the test loop (thinking that the first call to cublasDsyr routine may do something). But there is no difference. No matter what I do if I run culbasDsyr and then call cudamemcpy I get the time overhead on the first transfer.

I can’t imagine “cubasDsyr” having an effect on the memory, so I must be missing something trivial. I will post relevant pieces of the code here and really appreciate any input. Thanks…

#define N 512
double *H_Matrix, *D_Matrix, *H_vec, *D_vec;

//allocat memory
cudaMallocHost( (void**) &H_Matrix, NNsizeof(double))
H_vec = malloc(Nsizeof(double))
cudaMalloc( (void**) &D_Matrix, N
Nsizeof(double))
cudaMalloc( (void**) &D_vec, N
sizeof(double))

//Copy to device
cudaMemcpy( D_Matrix, H_Matrix, NNsizeof(double),cudaMemcpyHostToDevice)
cudaMemcpy( D_vec, H_vec, N*sizeof(double),cudaMemcpyHostToDevice)

//Blas Loop
for(i=;i<Iteration; i++)
cublasDsyr(‘u’, N, 1, D_vec, 1, D_Matrix, N)

//Get results back
cudaMemcpy( D_Matrix, D_Matrix, NNsizeof(double), cudaMemcpyDeviceToHost)); //measure time for this transfer

Calls to cublasDsyr() and similar functions are asynchronous, they return immediatly, but actual calculation is delayed until there is free GPU time.
cudaMemcpy() must of course wait for these calculations to finish. So the more you call cublasDsyr in the loop before, the more cudaMemcpy() has to wait.

try cudaThreadSynchronize()

//Copy to device

	cudaMemcpy( D_Matrix, H_Matrix, N*N*sizeof(double),cudaMemcpyHostToDevice) 

	cudaMemcpy( D_vec, H_vec, N*sizeof(double),cudaMemcpyHostToDevice) 

//Blas Loop

	for(i=;i<Iteration; i++)

		cublasDsyr('u', N, 1, D_vec, 1, D_Matrix, N)

	cudaThreadSynchronize();

	unsigned int timer;

 	float naiveTime;

 	cutCreateTimer(&timer);

	cutStartTimer(timer);

//Get results back

	cudaMemcpy( D_Matrix, D_Matrix, N*N*sizeof(double), cudaMemcpyDeviceToHost)); //measure time for this transfer

	cudaThreadSynchronize();

	cutStopTimer(timer);

	naiveTime = cutGetTimerValue(timer);

	printf(d2h cost %f ms\n", naiveTime );