cublas routine and cudamemcpy speed GPU->CPU transfer speed affected by cublasDsyr call?!

mahnaz · April 10, 2009, 6:20pm

Hi.
I am new to CUDA programming and hope to get some help here. I have a simple program using “cublasDsyr” in a loop, and then transferring the final result back. It seems like the timing of GPU->CPU “cudaMemcpy” is affected by the number of times that cublasDsyr is called. I am really puzzled. If iteration is 1, cudaMemcpy takes about 710 us. With every added iteration the transfer time increases by about 75us! So if iteration is 10 it takes 1395us to get the results back. If I put a loop around cudaMemcpy, the first transfer takes 1395us but the subsequent times are 623us, which is about what I expect for transfer of 512*512 matrix.

I have tried calling cudaMemcopy before starting the test, so transfering the cublasDsyr results isn’t the first GPU->CPU transfer, just the first after calling cublasDsyr. I even tried to call cublasDsyr once, do a memcopy of the results, and then start the test loop (thinking that the first call to cublasDsyr routine may do something). But there is no difference. No matter what I do if I run culbasDsyr and then call cudamemcpy I get the time overhead on the first transfer.

I can’t imagine “cubasDsyr” having an effect on the memory, so I must be missing something trivial. I will post relevant pieces of the code here and really appreciate any input. Thanks…

#define N 512
double *H_Matrix, *D_Matrix, *H_vec, *D_vec;

//allocat memory
cudaMallocHost( (void**) &H_Matrix, NNsizeof(double))
H_vec = malloc(Nsizeof(double))
cudaMalloc( (void**) &D_Matrix, NNsizeof(double))
cudaMalloc( (void**) &D_vec, Nsizeof(double))

//Copy to device
cudaMemcpy( D_Matrix, H_Matrix, NNsizeof(double),cudaMemcpyHostToDevice)
cudaMemcpy( D_vec, H_vec, N*sizeof(double),cudaMemcpyHostToDevice)

//Blas Loop
for(i=;i<Iteration; i++)
cublasDsyr(‘u’, N, 1, D_vec, 1, D_Matrix, N)

//Get results back
cudaMemcpy( D_Matrix, D_Matrix, NNsizeof(double), cudaMemcpyDeviceToHost)); //measure time for this transfer

Staro · October 26, 2009, 3:46pm

Calls to cublasDsyr() and similar functions are asynchronous, they return immediatly, but actual calculation is delayed until there is free GPU time.
cudaMemcpy() must of course wait for these calculations to finish. So the more you call cublasDsyr in the loop before, the more cudaMemcpy() has to wait.

LSChien · October 27, 2009, 1:51am

try cudaThreadSynchronize()

//Copy to device

	cudaMemcpy( D_Matrix, H_Matrix, N*N*sizeof(double),cudaMemcpyHostToDevice) 

	cudaMemcpy( D_vec, H_vec, N*sizeof(double),cudaMemcpyHostToDevice) 

//Blas Loop

	for(i=;i<Iteration; i++)

		cublasDsyr('u', N, 1, D_vec, 1, D_Matrix, N)

	cudaThreadSynchronize();

	unsigned int timer;

 	float naiveTime;

 	cutCreateTimer(&timer);

	cutStartTimer(timer);

//Get results back

	cudaMemcpy( D_Matrix, D_Matrix, N*N*sizeof(double), cudaMemcpyDeviceToHost)); //measure time for this transfer

	cudaThreadSynchronize();

	cutStopTimer(timer);

	naiveTime = cutGetTimerValue(timer);

	printf(d2h cost %f ms\n", naiveTime );

Topic		Replies	Views
CUBLAS VS CUDA Kernel CUDA Programming and Performance	2	6826	August 15, 2007
cublasSgemv & TransferTime CUDA Programming and Performance	3	10328	August 18, 2007
more touch, more time CUDA Programming and Performance	9	2017	April 23, 2010
Memory Transfer CUDA Programming and Performance	7	2997	October 10, 2008
Kernel dimension influences cudaMemcpy? CUDA Programming and Performance	4	2444	September 26, 2007
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3459	December 26, 2008
CUBLAS iteration processing time increases with iteration CUDA Programming and Performance	5	3595	August 17, 2007
Really strange memcpy time in matrixMul at SDK CUDA Programming and Performance	6	5126	July 9, 2009
About CUDA CUDA Programming and Performance	2	4733	December 3, 2008
Time taken by cublasSetVector() ? makes my application worst CUDA Programming and Performance	10	11501	October 25, 2007

cublas routine and cudamemcpy speed GPU->CPU transfer speed affected by cublasDsyr call?!

Related topics