I understand that the definition of active cycles is the number of cycles for which atleast 1 warp is active in a multi-processor. Does this include the cycles when a warp is waiting for memory data ?
I created a simple CUDA program with 1 just warp and a block and each thread accesses 1 element of 2 different arrays. These arrays are allocated on pinned host memory using cudaHostAlloc().
I don’t understand why there is a huge difference between active_cycles and elapsed_cycles_sm. The ipc computation seems to match with active_cycles (ipc=inst_executed/active_cycles). Can anyone please explain this?
==50555== NVPROF is profiling process 50555, command: ./mt 1 16 1 PINNED
ROWS= 1 COLS= 16 WSIZE= 1 MEM-TYPE= 2
BLOCKS_X= 1 BLOCKS_Y= 1
==50555== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==50555== Replaying kernel “d_mult(int, int, int, float*, float*, float*, int)” (done)
==50555== Profiling application: ./mt 1 16 1 PINNED
==50555== Profiling result:
==50555== Event result:
Invocations Event Name Min Max Avg
1 elapsed_cycles_sm 77756 77756 77756
1 inst_executed 44 44 44
1 thread_inst_executed 704 704 704
1 sm_cta_launched 1 1 1
1 warps_launched 1 1 1
1 threads_launched 16 16 16
1 gld_inst_32bit 32 32 32
1 active_cycles 1406 1406 1406
==50555== Metric result:
Invocations Metric Name Metric Description Min Max Avg
1 inst_executed Instructions Executed 44 44 44
1 inst_per_warp Instructions per warp 44.000000 44.000000 44.000000
1 ipc Executed IPC 0.031294 0.031294 0.031294
1 ipc_instance Executed IPC 0.031294 0.031294 0.031294
1 sm_efficiency Multiprocessor Activity 1.81% 1.81% 1.81%
1 sm_efficiency_instance Multiprocessor Activity 1.81% 1.81% 1.81%
1 ldst_executed Executed Load/Store Instructions 3 3 3
1 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.029872 0.029872 0.029872