time problem for data transfer and kernel execution fail to get the partial time separately

Edwen · October 14, 2011, 2:11am

Hi:

I am trying to optimize one of my program, and thus want to know how much time is used on data transfer and kernel execution. Part of the code is:

cutilSafeCall(cudaMemcpy(d_MDSPrice, MDS_Price, mem_size_MDS, cudaMemcpyHostToDevice));

calNAV(d_MDSPrice, d_ETFData, d_ETFIndexData, d_ETFIndexGPU); //the kernel function is called here;

cutilSafeCall(cudaMemcpy(ETFIndexGPU, d_ETFIndexGPU, mem_size_Index, cudaMemcpyDeviceToHost));

The data size to and from the host is around 200KB and I expect the data transfer time is around 0.2ms. The total GPU time is around 5ms. However, when I used 3 different timers to read the partial time of the data transfer and kernel execution, I found the first data transfer time (host to device) is 0.2ms, the kernel execution time is 0.02ms, and the second data transfer time (device to host) is around 4.8ms. This is obviously wrong. Is it because the timers dont work properly? I use the cutGetTimerValue() function to read time. Seems the main part of the kernel execution time (which should be the longest) is added into the time for the second data transfer.

I dont have any problem when I replaced my kernel with the that in the sample cuda program “VectorAdd”. All the partial time look reasonable. So the problem must be caused by my kernel. My kernel is a bit complex, but it runs ok. The results match with that I got from the CPU function.

Please advise. Thanks in advance.

Topic		Replies	Views
data transmission time CUDA Programming and Performance	4	1554	December 8, 2008
Timing the Kernel CUDA Programming and Performance	3	3727	January 15, 2010
Memory Transfer CUDA Programming and Performance	7	2959	October 10, 2008
Very slow memory transfer problem Simple program executes very slowly, bandwidth test shows normal r CUDA Programming and Performance	2	907	February 7, 2011
Kernel execution overhead CUDA Programming and Performance	2	1159	July 6, 2009
DMA transfers in parallel 2-way SLI with 2 GTX 280 CUDA Programming and Performance	8	3694	March 16, 2009
Kernel dimension influences cudaMemcpy? CUDA Programming and Performance	4	2413	September 26, 2007
Getting different time for kernel execution. CUDA Programming and Performance	6	5900	July 30, 2009
Copying memory from device to Host takes too much time CUDA Programming and Performance	7	3392	October 5, 2010
Problem with CudaMemcpy CUDA Programming and Performance	1	693	March 18, 2014

time problem for data transfer and kernel execution fail to get the partial time separately

Related topics