nvprof active_cycles vs elapsed_cycles_sm

NC1 · August 26, 2016, 5:19pm

I understand that the definition of active cycles is the number of cycles for which atleast 1 warp is active in a multi-processor. Does this include the cycles when a warp is waiting for memory data ?
I created a simple CUDA program with 1 just warp and a block and each thread accesses 1 element of 2 different arrays. These arrays are allocated on pinned host memory using cudaHostAlloc().
I don’t understand why there is a huge difference between active_cycles and elapsed_cycles_sm. The ipc computation seems to match with active_cycles (ipc=inst_executed/active_cycles). Can anyone please explain this?

==50555== NVPROF is profiling process 50555, command: ./mt 1 16 1 PINNED
ROWS= 1 COLS= 16 WSIZE= 1 MEM-TYPE= 2
BLOCKS_X= 1 BLOCKS_Y= 1
==50555== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==50555== Replaying kernel “d_mult(int, int, int, float*, float*, float*, int)” (done)
==50555== Profiling application: ./mt 1 16 1 PINNED
==50555== Profiling result:
==50555== Event result:
Invocations Event Name Min Max Avg

      1                         elapsed_cycles_sm       77756       77756       77756
      1                             inst_executed          44          44          44
      1                      thread_inst_executed         704         704         704
      1                           sm_cta_launched           1           1           1
      1                            warps_launched           1           1           1
      1                          threads_launched          16          16          16
      1                            gld_inst_32bit          32          32          32
      1                             active_cycles        1406        1406        1406

==50555== Metric result:
Invocations Metric Name Metric Description Min Max Avg

      1                             inst_executed                     Instructions Executed          44          44          44
      1                             inst_per_warp                     Instructions per warp   44.000000   44.000000   44.000000
      1                                       ipc                              Executed IPC    0.031294    0.031294    0.031294
      1                              ipc_instance                              Executed IPC    0.031294    0.031294    0.031294
      1                             sm_efficiency                   Multiprocessor Activity       1.81%       1.81%       1.81%
      1                    sm_efficiency_instance                   Multiprocessor Activity       1.81%       1.81%       1.81%
      1                             ldst_executed          Executed Load/Store Instructions           3           3           3
      1                  eligible_warps_per_cycle           Eligible Warps Per Active Cycle    0.029872    0.029872    0.029872

Robert_Crovella · August 26, 2016, 10:41pm

No. A warp waiting for memory activity is stalled, and does not consume a scheduler slot, meaning it is not considered by the scheduler as an active warp, i.e. having an instruction available for scheduling (although it does consume a warp slot for purposes of residency - it counts against the maximum warps per multiprocessor limit)

In light of this, I assume it is clear why active_cycles is much lower than elapsed_cycles. It’s also connected to to eligible_warps_per_cycle which is very low, and sm_efficiency which is very low.

NC1 · August 26, 2016, 11:06pm

Thank you very much ! In that case, shouldn’t the cycles for IPC also include the stalled cycles? I have only 1 warp here, so there is no scope for switching context to a different warp. But IPC here seems to be computed as inst_executed/active_cycles.

Robert_Crovella · August 27, 2016, 12:12am

I don’t think this definition of IPC applies to cycles where no instructions were issued.

Topic		Replies	Views
computeprof "active cycles" counter "active cycles" value doesn't make sense to CUDA Programming and Performance	7	2584	May 15, 2012
Question for sm__elapsed_cycles_sum Nsight Compute	2	927	March 26, 2020
active warps/active cycles in Profiler Small error in documentation for Profiler? CUDA Programming and Performance	0	878	March 27, 2011
Warp Size Question CUDA Programming and Performance	21	14086	June 18, 2010
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7232	March 30, 2013
How to squeeze the performance on G80? CUDA Programming and Performance	1	3791	April 17, 2008
CUDA profiling Extract the number of clock cycles of a CUDA application execution CUDA Programming and Performance	2	8532	August 23, 2011
How to accurately time individual memory operations CUDA Programming and Performance	17	6325	September 12, 2016
Difference between eligible_warps_per_cycle, sm_efficiency, and achieved_occupancy of nvprof metrics? CUDA Programming and Performance	0	751	May 6, 2018
total number clock cycles in profiler which profiler counter? CUDA Programming and Performance	1	687	January 13, 2012

nvprof active_cycles vs elapsed_cycles_sm

Related topics