Understanding the memory latency when using CUDA profiler vs cudaEventRecord

I am trying to understand how the timing is done for CUDA programs.

I have attached a table of my results for the Tesla C1060 and the C2050. I am calling three kernels in my CUDA program and all these results are averaged over 2000 runs.

Execution time in usecs

			   C1060	C2050

						

reduce	95.854	48.05	

find_max	42.226	19.14

xy_field	37.464	14.04 // all these values correspond to the GPU Time  column of the profiler output

						

Total time	769139	924073 //value obtained by using gettimeofday with usec resolution

for 2000 

runs

Average	384.569	462.036 //divided by 2000

The time taken to zero-copy the data is only about 3 usecs. After that the kernel execution time is the GPU Time seen in the profile information. The total time taken is calculated using gettimeofday(). This value corresponds to the time seen by the cudaEventRecord function. However, when I add the values in the GPU Time column, it’s 175.544 usecs for the C1060 and 81.23 usecs for the C2050. This means that I am losing 209.025 usecs for data access from global memory in the C1060 and 380.806 usecs on the C2050.

Is this correct? Also, I am seeing that C1060 is faster than the C2050 here. I looked at the occupancy and the C2050 is less occupied than C1060. I did go through Volkov’s presentation and Chapter 4 of the Best Practices Guide that mention that high occupancy does not necessarily mean faster execution. But I see the converse here in terms of overall execution time. (I run the kernels using only 16 threads on C1060 and C2050).

Suggestions?? Thanks in advance :)

I am trying to understand how the timing is done for CUDA programs.

I have attached a table of my results for the Tesla C1060 and the C2050. I am calling three kernels in my CUDA program and all these results are averaged over 2000 runs.

Execution time in usecs

			   C1060	C2050

						

reduce	95.854	48.05	

find_max	42.226	19.14

xy_field	37.464	14.04 // all these values correspond to the GPU Time  column of the profiler output

						

Total time	769139	924073 //value obtained by using gettimeofday with usec resolution

for 2000 

runs

Average	384.569	462.036 //divided by 2000

The time taken to zero-copy the data is only about 3 usecs. After that the kernel execution time is the GPU Time seen in the profile information. The total time taken is calculated using gettimeofday(). This value corresponds to the time seen by the cudaEventRecord function. However, when I add the values in the GPU Time column, it’s 175.544 usecs for the C1060 and 81.23 usecs for the C2050. This means that I am losing 209.025 usecs for data access from global memory in the C1060 and 380.806 usecs on the C2050.

Is this correct? Also, I am seeing that C1060 is faster than the C2050 here. I looked at the occupancy and the C2050 is less occupied than C1060. I did go through Volkov’s presentation and Chapter 4 of the Best Practices Guide that mention that high occupancy does not necessarily mean faster execution. But I see the converse here in terms of overall execution time. (I run the kernels using only 16 threads on C1060 and C2050).

Suggestions?? Thanks in advance :)

suggestions?

suggestions?

added some results

added some results

Use more threads. Many more threads. And in determining absolute performance it is best to wall clock time the kernel you are interested in without profiling.

Use more threads. Many more threads. And in determining absolute performance it is best to wall clock time the kernel you are interested in without profiling.

@ eelsen If I use more than 16 threads my answers are wrong

@ eelsen If I use more than 16 threads my answers are wrong