I am trying to understand how the timing is done for CUDA programs.
I have attached a table of my results for the Tesla C1060 and the C2050. I am calling three kernels in my CUDA program and all these results are averaged over 2000 runs.
Execution time in usecs
C1060 C2050
reduce 95.854 48.05
find_max 42.226 19.14
xy_field 37.464 14.04 // all these values correspond to the GPU Time column of the profiler output
Total time 769139 924073 //value obtained by using gettimeofday with usec resolution
for 2000
runs
Average 384.569 462.036 //divided by 2000
The time taken to zero-copy the data is only about 3 usecs. After that the kernel execution time is the GPU Time seen in the profile information. The total time taken is calculated using gettimeofday(). This value corresponds to the time seen by the cudaEventRecord function. However, when I add the values in the GPU Time column, it’s 175.544 usecs for the C1060 and 81.23 usecs for the C2050. This means that I am losing 209.025 usecs for data access from global memory in the C1060 and 380.806 usecs on the C2050.
Is this correct? Also, I am seeing that C1060 is faster than the C2050 here. I looked at the occupancy and the C2050 is less occupied than C1060. I did go through Volkov’s presentation and Chapter 4 of the Best Practices Guide that mention that high occupancy does not necessarily mean faster execution. But I see the converse here in terms of overall execution time. (I run the kernels using only 16 threads on C1060 and C2050).
I am trying to understand how the timing is done for CUDA programs.
I have attached a table of my results for the Tesla C1060 and the C2050. I am calling three kernels in my CUDA program and all these results are averaged over 2000 runs.
Execution time in usecs
C1060 C2050
reduce 95.854 48.05
find_max 42.226 19.14
xy_field 37.464 14.04 // all these values correspond to the GPU Time column of the profiler output
Total time 769139 924073 //value obtained by using gettimeofday with usec resolution
for 2000
runs
Average 384.569 462.036 //divided by 2000
The time taken to zero-copy the data is only about 3 usecs. After that the kernel execution time is the GPU Time seen in the profile information. The total time taken is calculated using gettimeofday(). This value corresponds to the time seen by the cudaEventRecord function. However, when I add the values in the GPU Time column, it’s 175.544 usecs for the C1060 and 81.23 usecs for the C2050. This means that I am losing 209.025 usecs for data access from global memory in the C1060 and 380.806 usecs on the C2050.
Is this correct? Also, I am seeing that C1060 is faster than the C2050 here. I looked at the occupancy and the C2050 is less occupied than C1060. I did go through Volkov’s presentation and Chapter 4 of the Best Practices Guide that mention that high occupancy does not necessarily mean faster execution. But I see the converse here in terms of overall execution time. (I run the kernels using only 16 threads on C1060 and C2050).
Use more threads. Many more threads. And in determining absolute performance it is best to wall clock time the kernel you are interested in without profiling.
Use more threads. Many more threads. And in determining absolute performance it is best to wall clock time the kernel you are interested in without profiling.