I’m doing the following, in pseudocode:
kernel_launch <<< >>>();
The Nvidia profile tool computeprof says that the kernel takes 0.95 seconds of GPU + CPU time. But the elapsed time I calculate for the cudaThreadSynchronize() call is about 2.9 seconds. Why does it take so much time? Am I misunderstanding the computeprof results and my kernel really takes much more than 0.95 seconds to execute?
I am using a GTX480 card on
x86_64 Red Hat Enterprise Linux Client release 5.4 (Tikanga)
Nvidia driver version 256.40
The Cuda toolkit I downloaded was cudatoolkit_3.1_linux_64_rhel5.4.run