computeprof "active cycles" counter "active cycles" value doesn't make sense to

natan88a · May 5, 2012, 6:47pm

Hi,

I’m profiling my application using computeprof profiler supplied with cuda toolkit 4.0 (not nvvp which supplied with cuda 4.1), on Tesla C2075.
One of the profiler’s counters is “active cycle”, which is used to calculate IPC, average occupancy and SM efficiency.

I don’t understand the value I get for “active cycle”, according to profiler’s help/guide, SM efficiency is calculated as “active cycles”/“elapsed clocks”, I assume that “elapsed clocks” is “GPU TIME”/“GPU Cycle” = “GPU TIME” * “GPU Freq.”, the profiler reports high SM efficiency (95%-100%), which means that “GPU TIME” * “GPU Freq.” ~ “active cycles”, but I get that “GPU TIME” * “GPU Freq.” is about 2x higher.

Can anyone clear this point for me?

BTW - Tesla C2075 frequency is 1.15GHz (schedule’s frequency), I assume this is the frequency used for calculation.

Thanks,
Natan

vvolkov · May 7, 2012, 10:05pm

never tried to put these profiler numbers together, but here is a guess - pre-Kepler GPUs have two clock cycles, the higher is used by the arithmetic pipelines, the lower is used by the rest of the multiprocessor, such as instruction issue. The difference between them is factor of 2. May be active cycles are reported in the lower frequency.

Greg · May 9, 2012, 4:20am

The SM PM counters increment at the Graphics Clock == 1/2 Processor Clock.

natan88a · May 10, 2012, 2:31pm

But the graphics clock of tesla c2075 is 1.15 GHz, doesn’t it?
According to nVidia, this GPU can process up to 1 Tflops → 1 Tflops = 16 SMs * 32 cores/SM * 1.15 GHz * 2 (IPC), which means the IPC is calculated relatively to 1.15 GHz frequency, which means that this is the lower frequency, ie. processor frequency is 2.3GHz.

seibert · May 13, 2012, 8:55pm

No, the clock rate for the CUDA cores is 1.15 GHz. The factor of 2 in the quoted FLOPS numbers is not an IPC factor. Each CUDA core pipeline finishes 1 instruction per clock (for most floating point instructions), but there happens to be one instruction (the fused multiply-add) that does two floating point operations at the same time. The theoretical 1.1 TFLOPS for the Tesla C2075 assumes that your instruction sequence is nothing but FMA instructions. In real programs, the throughput will be lower as other instructions do not perform 2 floating point operations at once.

natan88a · May 14, 2012, 4:46pm

Wow, if that is true, I was way off.
so FMA instruction is counted as two instructions in the profiler? what about SCADD (shift + add) and MAD? are they also counted as two instructions?

Thanks!

seibert · May 14, 2012, 7:54pm

No, I don’t think FMA is counted as two instructions in the profiler, but FMA is counted as two floating point operations (not instructions!) in NVIDIA marketing materials.

natan88a · May 15, 2012, 6:50am

Thank you, you were very helpful.

Topic		Replies	Views
CUDA profiling Extract the number of clock cycles of a CUDA application execution CUDA Programming and Performance	2	8568	August 23, 2011
nvprof active_cycles vs elapsed_cycles_sm CUDA Programming and Performance	3	2680	August 27, 2016
Gpu__cycles_active vs. sm__cycles_active.max Nsight Compute	3	579	February 26, 2024
Is Cycles dependent on Time or reverse? Nsight Compute	1	609	November 30, 2021
How to figure out the ratio of the number of GPU-cycles fp64 to the total number of cycles? Nsight Compute performance-metrics	2	1129	April 25, 2022
IPC at device level Nsight Compute	3	750	October 26, 2021
What exactly does SM Active Cycles mean? Nsight Compute	3	1767	July 30, 2024
[Solved]Relation of elapsed_cycles_sm and kernel execution time in cuda CUDA Programming and Performance	1	1334	September 8, 2016
Graphics perf counters meaning Nsight Graphics kernel	4	382	September 6, 2024
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7348	March 30, 2013

computeprof "active cycles" counter "active cycles" value doesn't make sense to

Related topics