Why is cudaThreadSynchronize() so expensive?

kmccall · October 20, 2010, 7:32pm

I’m doing the following, in pseudocode:

[codebox]

get_wall_clock_time();

kernel_launch <<< >>>();

get_wall_clock_time();

cudaThreadSynchronize();

get_wall_clock_time();[/codebox]

The Nvidia profile tool computeprof says that the kernel takes 0.95 seconds of GPU + CPU time. But the elapsed time I calculate for the cudaThreadSynchronize() call is about 2.9 seconds. Why does it take so much time? Am I misunderstanding the computeprof results and my kernel really takes much more than 0.95 seconds to execute?

Thanks.

I am using a GTX480 card on

x86_64 Red Hat Enterprise Linux Client release 5.4 (Tikanga)

Nvidia driver version 256.40

The Cuda toolkit I downloaded was cudatoolkit_3.1_linux_64_rhel5.4.run

kmccall · October 20, 2010, 7:32pm

I’m doing the following, in pseudocode:

[codebox]

get_wall_clock_time();

kernel_launch <<< >>>();

get_wall_clock_time();

cudaThreadSynchronize();

get_wall_clock_time();[/codebox]

The Nvidia profile tool computeprof says that the kernel takes 0.95 seconds of GPU + CPU time. But the elapsed time I calculate for the cudaThreadSynchronize() call is about 2.9 seconds. Why does it take so much time? Am I misunderstanding the computeprof results and my kernel really takes much more than 0.95 seconds to execute?

Thanks.

I am using a GTX480 card on

x86_64 Red Hat Enterprise Linux Client release 5.4 (Tikanga)

Nvidia driver version 256.40

The Cuda toolkit I downloaded was cudatoolkit_3.1_linux_64_rhel5.4.run

vRavi · October 20, 2010, 9:19pm

It is not reliable to get wall clock time immediately after the kernel_launch. This is because, there is no guarantee that all the thread blocks and its threads have completed the execution. Also, kernel_launch is a combination of cudaLaunch() and actual computation. The cudaLaunch returns immediately after the kernel is launched by the user, while the computation is still going on, on the video card. So, it is reliable only to measure the time for kernel execution after the cudaThreadSynchronize(). This is also mentioned somewhere in the programming guide.

Hope this helps!

vRavi · October 20, 2010, 9:19pm

It is not reliable to get wall clock time immediately after the kernel_launch. This is because, there is no guarantee that all the thread blocks and its threads have completed the execution. Also, kernel_launch is a combination of cudaLaunch() and actual computation. The cudaLaunch returns immediately after the kernel is launched by the user, while the computation is still going on, on the video card. So, it is reliable only to measure the time for kernel execution after the cudaThreadSynchronize(). This is also mentioned somewhere in the programming guide.

Hope this helps!

kmccall · October 21, 2010, 2:05pm

Thanks, that makes sense. The question then is, why does computeprof say that the kernel executed in 0.95 seconds, while the time from kernel launch to the end of cudaThreadSynchronize() is 2.9 seconds? According to what you’ve said, the kernel actually took 2.9 seconds.

kmccall · October 21, 2010, 2:05pm

Thanks, that makes sense. The question then is, why does computeprof say that the kernel executed in 0.95 seconds, while the time from kernel launch to the end of cudaThreadSynchronize() is 2.9 seconds? According to what you’ve said, the kernel actually took 2.9 seconds.

kmccall · October 21, 2010, 2:13pm

Ah, found the reason. When I used computeprof to profile the kernel, I compiled without the -G nvcc debug flag. But when I ran the wall clock test, I compiled with the -G flag. I read a post that said that -G generally slows things down. Now, when I run the wall clock test without -G, the results are consistent with the computeprof results of 0.95 seconds kernel execution time.

I’ll have to remember to remove -G when I do performance testing!

kmccall · October 21, 2010, 2:13pm

Ah, found the reason. When I used computeprof to profile the kernel, I compiled without the -G nvcc debug flag. But when I ran the wall clock test, I compiled with the -G flag. I read a post that said that -G generally slows things down. Now, when I run the wall clock test without -G, the results are consistent with the computeprof results of 0.95 seconds kernel execution time.

I’ll have to remember to remove -G when I do performance testing!

Topic		Replies	Views
What determines the amount of time spent on my `cudaSynchronize` call? CUDA Programming and Performance	1	1112	February 21, 2019
is cudaThreadSynchronize() will take 600+ms to execute? CUDA Programming and Performance	3	1540	April 21, 2009
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2775	November 17, 2021
Kernel can not run parallelly with CPU codes OK in XP, failed in Vista CUDA Programming and Performance	4	8689	December 5, 2008
weird thing about timing a function in cuda code CUDA Programming and Performance	4	7359	January 15, 2010
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7737	December 5, 2008
Kernel execution blocks CPU code CUDA Programming and Performance	9	3956	September 8, 2009
Latency when running a cuda code CUDA Programming and Performance	10	3431	December 30, 2020
Very slow kernel launches CUDA Programming and Performance	8	7755	March 28, 2015
Strange Runtime behavior CUDA Programming and Performance	7	3103	December 18, 2009

Why is cudaThreadSynchronize() so expensive?

Related topics