Why is cudaThreadSynchronize() so expensive?

I’m doing the following, in pseudocode:

[codebox]

get_wall_clock_time();

kernel_launch <<< >>>();

get_wall_clock_time();

cudaThreadSynchronize();

get_wall_clock_time();[/codebox]

The Nvidia profile tool computeprof says that the kernel takes 0.95 seconds of GPU + CPU time. But the elapsed time I calculate for the cudaThreadSynchronize() call is about 2.9 seconds. Why does it take so much time? Am I misunderstanding the computeprof results and my kernel really takes much more than 0.95 seconds to execute?

Thanks.

I am using a GTX480 card on

x86_64 Red Hat Enterprise Linux Client release 5.4 (Tikanga)

Nvidia driver version 256.40

The Cuda toolkit I downloaded was cudatoolkit_3.1_linux_64_rhel5.4.run

I’m doing the following, in pseudocode:

[codebox]

get_wall_clock_time();

kernel_launch <<< >>>();

get_wall_clock_time();

cudaThreadSynchronize();

get_wall_clock_time();[/codebox]

The Nvidia profile tool computeprof says that the kernel takes 0.95 seconds of GPU + CPU time. But the elapsed time I calculate for the cudaThreadSynchronize() call is about 2.9 seconds. Why does it take so much time? Am I misunderstanding the computeprof results and my kernel really takes much more than 0.95 seconds to execute?

Thanks.

I am using a GTX480 card on

x86_64 Red Hat Enterprise Linux Client release 5.4 (Tikanga)

Nvidia driver version 256.40

The Cuda toolkit I downloaded was cudatoolkit_3.1_linux_64_rhel5.4.run

It is not reliable to get wall clock time immediately after the kernel_launch. This is because, there is no guarantee that all the thread blocks and its threads have completed the execution. Also, kernel_launch is a combination of cudaLaunch() and actual computation. The cudaLaunch returns immediately after the kernel is launched by the user, while the computation is still going on, on the video card. So, it is reliable only to measure the time for kernel execution after the cudaThreadSynchronize(). This is also mentioned somewhere in the programming guide.

Hope this helps!

It is not reliable to get wall clock time immediately after the kernel_launch. This is because, there is no guarantee that all the thread blocks and its threads have completed the execution. Also, kernel_launch is a combination of cudaLaunch() and actual computation. The cudaLaunch returns immediately after the kernel is launched by the user, while the computation is still going on, on the video card. So, it is reliable only to measure the time for kernel execution after the cudaThreadSynchronize(). This is also mentioned somewhere in the programming guide.

Hope this helps!

Thanks, that makes sense. The question then is, why does computeprof say that the kernel executed in 0.95 seconds, while the time from kernel launch to the end of cudaThreadSynchronize() is 2.9 seconds? According to what you’ve said, the kernel actually took 2.9 seconds.

Thanks, that makes sense. The question then is, why does computeprof say that the kernel executed in 0.95 seconds, while the time from kernel launch to the end of cudaThreadSynchronize() is 2.9 seconds? According to what you’ve said, the kernel actually took 2.9 seconds.

Ah, found the reason. When I used computeprof to profile the kernel, I compiled without the -G nvcc debug flag. But when I ran the wall clock test, I compiled with the -G flag. I read a post that said that -G generally slows things down. Now, when I run the wall clock test without -G, the results are consistent with the computeprof results of 0.95 seconds kernel execution time.

I’ll have to remember to remove -G when I do performance testing!

Ah, found the reason. When I used computeprof to profile the kernel, I compiled without the -G nvcc debug flag. But when I ran the wall clock test, I compiled with the -G flag. I read a post that said that -G generally slows things down. Now, when I run the wall clock test without -G, the results are consistent with the computeprof results of 0.95 seconds kernel execution time.

I’ll have to remember to remove -G when I do performance testing!