very slow function next to kernel

dulcet · August 10, 2008, 5:38pm

I just encountered a weird problem. And It is a little hard to describe that.

The step next to some kernel becomes very slow, no matter what step is.

For example

cudaMemcpy data from host

Kernel<<<grid,threads>>>(......)

cudaMemcpy result1 to host

cudaMemcpy result2 to host

“cudaMemcpy result1 to host” becomes very slow.

And if i swap “cudaMemcpy result1 to host” with “cudaMemcpy result2 to host” ,

“cudaMemcpy result2 to host” becomes very slow, and “cudaMemcpy result1 to host” become normal.

Even I add some trivial function next to Kernel.

cudaMemcpy data from host

Kernel<<<grid,threads>>>(......)

cudaMemcpy some tiny memory   <---- trivial one

cudaMemcpy result1 to host

cudaMemcpy result2 to host

Then that function also become very slow. The others become normal.

This trivial function even costs more time than sum of the rest.

And it only happened in some particular kernel. Actually, it only happened in the kernel i recently wrote. This kernel is 1D convoulation along x direction. It is very similar to row convoluation in SDK.

my input is 4096 X 500.

block size is 256 X 1

Can any guy know this problem?

Thanks

MisterAnderson42 · August 10, 2008, 6:10pm

Kernel calls are asynchronous and return immediately. If you want to perform timing, you must call cudaThreadSynchronize() before making any timing measurement. cudaMemcpy has an implicit synchronize built into it: so the “very slow” call you are measuring is in including the time to execute the kernel.

Tigga · August 10, 2008, 6:10pm

I just encountered a weird problem. And It is a little hard to describe that.

The step next to some kernel becomes very slow, no matter what step is.

For example
cudaMemcpy data from host

Kernel<<<grid,threads>>>(......)

cudaMemcpy result1 to host

cudaMemcpy result2 to host
“cudaMemcpy result1 to host” becomes very slow.

And if i swap “cudaMemcpy result1 to host” with “cudaMemcpy result2 to host” ,

“cudaMemcpy result2 to host” becomes very slow, and “cudaMemcpy result1 to host” become normal.

Kernel calls are asynchronous - the function returns immediately. The memory copies you’re doing aren’t, and have to wait for the kernel to finish executing. To be able to time a kernal properly you should put cudaThreadSynchronize() after the kernel call:

cudaMemcpy data from host

Kernel<<<grid,threads>>>(......)

cudaThreadSynchronize()

cudaMemcpy result1 to host

cudaMemcpy result2 to host

Ailleur · August 10, 2008, 6:11pm

I would guess that it is because there is an implicit syncthreads before the start of a memcpy function.
Try adding a threadsynchronize() before the memory call and see if it makes all memcpy calls equal.

–edit: beat to it while i was typing!

Topic		Replies	Views
Why cudaMemcpyDeviceToHost is too slowly? CUDA Programming and Performance	1	579	November 16, 2021
Problem with CudaMemcpy CUDA Programming and Performance	1	693	March 18, 2014
Getting diff time statistics for same function Totally confused after seeing results CUDA Programming and Performance	3	4179	December 4, 2007
cuda kernel call within for loop gets slow, crashes CUDA Programming and Performance	5	5798	April 1, 2012
Kernel dimension influences cudaMemcpy? CUDA Programming and Performance	4	2413	September 26, 2007
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25010	March 8, 2010
Slow memory transfers CUDA Programming and Performance	7	1986	May 23, 2011
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9871	February 8, 2008
CUDA timing CUDA Programming and Performance	4	3238	May 9, 2008
CUDA trouble CUDA Programming and Performance	3	976	March 19, 2013

very slow function next to kernel

Related topics