time problem for data transfer and kernel execution fail to get the partial time separately


I am trying to optimize one of my program, and thus want to know how much time is used on data transfer and kernel execution. Part of the code is:

cutilSafeCall(cudaMemcpy(d_MDSPrice, MDS_Price, mem_size_MDS, cudaMemcpyHostToDevice));

calNAV(d_MDSPrice, d_ETFData, d_ETFIndexData, d_ETFIndexGPU); //the kernel function is called here;

cutilSafeCall(cudaMemcpy(ETFIndexGPU, d_ETFIndexGPU, mem_size_Index, cudaMemcpyDeviceToHost));

The data size to and from the host is around 200KB and I expect the data transfer time is around 0.2ms. The total GPU time is around 5ms. However, when I used 3 different timers to read the partial time of the data transfer and kernel execution, I found the first data transfer time (host to device) is 0.2ms, the kernel execution time is 0.02ms, and the second data transfer time (device to host) is around 4.8ms. This is obviously wrong. Is it because the timers dont work properly? I use the cutGetTimerValue() function to read time. Seems the main part of the kernel execution time (which should be the longest) is added into the time for the second data transfer.

I dont have any problem when I replaced my kernel with the that in the sample cuda program “VectorAdd”. All the partial time look reasonable. So the problem must be caused by my kernel. My kernel is a bit complex, but it runs ok. The results match with that I got from the CPU function.

Please advise. Thanks in advance.