nvprof active_cycles vs elapsed_cycles_sm

I understand that the definition of active cycles is the number of cycles for which atleast 1 warp is active in a multi-processor. Does this include the cycles when a warp is waiting for memory data ?
I created a simple CUDA program with 1 just warp and a block and each thread accesses 1 element of 2 different arrays. These arrays are allocated on pinned host memory using cudaHostAlloc().
I don’t understand why there is a huge difference between active_cycles and elapsed_cycles_sm. The ipc computation seems to match with active_cycles (ipc=inst_executed/active_cycles). Can anyone please explain this?

==50555== NVPROF is profiling process 50555, command: ./mt 1 16 1 PINNED
ROWS= 1 COLS= 16 WSIZE= 1 MEM-TYPE= 2
BLOCKS_X= 1 BLOCKS_Y= 1
==50555== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==50555== Replaying kernel “d_mult(int, int, int, float*, float*, float*, int)” (done)
==50555== Profiling application: ./mt 1 16 1 PINNED
==50555== Profiling result:
==50555== Event result:
Invocations Event Name Min Max Avg

      1                         elapsed_cycles_sm       77756       77756       77756
      1                             inst_executed          44          44          44
      1                      thread_inst_executed         704         704         704
      1                           sm_cta_launched           1           1           1
      1                            warps_launched           1           1           1
      1                          threads_launched          16          16          16
      1                            gld_inst_32bit          32          32          32
      1                             active_cycles        1406        1406        1406

==50555== Metric result:
Invocations Metric Name Metric Description Min Max Avg

      1                             inst_executed                     Instructions Executed          44          44          44
      1                             inst_per_warp                     Instructions per warp   44.000000   44.000000   44.000000
      1                                       ipc                              Executed IPC    0.031294    0.031294    0.031294
      1                              ipc_instance                              Executed IPC    0.031294    0.031294    0.031294
      1                             sm_efficiency                   Multiprocessor Activity       1.81%       1.81%       1.81%
      1                    sm_efficiency_instance                   Multiprocessor Activity       1.81%       1.81%       1.81%
      1                             ldst_executed          Executed Load/Store Instructions           3           3           3
      1                  eligible_warps_per_cycle           Eligible Warps Per Active Cycle    0.029872    0.029872    0.029872

No. A warp waiting for memory activity is stalled, and does not consume a scheduler slot, meaning it is not considered by the scheduler as an active warp, i.e. having an instruction available for scheduling (although it does consume a warp slot for purposes of residency - it counts against the maximum warps per multiprocessor limit)

In light of this, I assume it is clear why active_cycles is much lower than elapsed_cycles. It’s also connected to to eligible_warps_per_cycle which is very low, and sm_efficiency which is very low.

Thank you very much ! In that case, shouldn’t the cycles for IPC also include the stalled cycles? I have only 1 warp here, so there is no scope for switching context to a different warp. But IPC here seems to be computed as inst_executed/active_cycles.

I don’t think this definition of IPC applies to cycles where no instructions were issued.