I just encountered a weird problem. And It is a little hard to describe that.
The step next to some kernel becomes very slow, no matter what step is.
For example
cudaMemcpy data from host
Kernel<<<grid,threads>>>(......)
cudaMemcpy result1 to host
cudaMemcpy result2 to host
“cudaMemcpy result1 to host” becomes very slow.
And if i swap “cudaMemcpy result1 to host” with “cudaMemcpy result2 to host” ,
“cudaMemcpy result2 to host” becomes very slow, and “cudaMemcpy result1 to host” become normal.
Even I add some trivial function next to Kernel.
cudaMemcpy data from host
Kernel<<<grid,threads>>>(......)
cudaMemcpy some tiny memory <---- trivial one
cudaMemcpy result1 to host
cudaMemcpy result2 to host
Then that function also become very slow. The others become normal.
This trivial function even costs more time than sum of the rest.
And it only happened in some particular kernel. Actually, it only happened in the kernel i recently wrote. This kernel is 1D convoulation along x direction. It is very similar to row convoluation in SDK.
Kernel calls are asynchronous and return immediately. If you want to perform timing, you must call cudaThreadSynchronize() before making any timing measurement. cudaMemcpy has an implicit synchronize built into it: so the “very slow” call you are measuring is in including the time to execute the kernel.
Kernel calls are asynchronous - the function returns immediately. The memory copies you’re doing aren’t, and have to wait for the kernel to finish executing. To be able to time a kernal properly you should put cudaThreadSynchronize() after the kernel call:
cudaMemcpy data from host
Kernel<<<grid,threads>>>(......)
cudaThreadSynchronize()
cudaMemcpy result1 to host
cudaMemcpy result2 to host
I would guess that it is because there is an implicit syncthreads before the start of a memcpy function.
Try adding a threadsynchronize() before the memory call and see if it makes all memcpy calls equal.