In the GPC having the max cycles, the start mark of elapsed cycles is when the scheduler sends the task to the first SM, and the end mark is when the last SM completes the task and returns information to the scheduler. This period includes the cycle of SM parameters acquisition, kernel acquisition and other processes. Therefore, the “Elapsed cycles” of the empty kernel is not 0. Is it correct?
2、“The average SM Active Cycles” is always smaller than “Elapsed Cycles”. Is it correct?
The Hardware Performance Monitor start trigger is issued before the grid is launched by the front end (command processor) to the compute pipe. The grid is scheduled to the compute work distributor. The compute work distributor launches thread blocks on SMs. Each thread block will execute a few instructions (disassemble the kernel). The thread block will complete, free resources, and be reported as completed. When all thread blocks complete a memory barrier is issued to flush all writes from SMs to system memory (default) or global memory (optional opt-in API). The Hardware Performance Monitor end trigger is issued after all SMs have been flushed.
I would expect the gpc__cycles_elapsed.avg for a single thread block null kernel to be between 1000-9000 cycles depending on the chip and the frequency during collection.
There are elapsed cycle counters for each performance monitor. These are denoted as <unit>__cycles_elapsed. This allows the metrics library to handle different clock domains.
sm__cycles_elapsed == gpc__cycles_elapsed are in the same clock domain so should have very close values if not exactly the same value.
If collected in the same replay then sm__cycles_active.avg is guaranteed to be less than or equal to sm__cycles_elapsed.avg. If these counters are collected in different passes then there is a chance that this condition will not be true.
Will the L1 cache and L2 cache be invalided? If so, will the invalidation process be counted in gpc__cycles_elapsed.avg? Is the invalidation process done during the “Each thread block will execute a few instructions (disassemble the kernel)” stage?
I got it, thank you. I have another question. “When all thread blocks complete a memory barrier is issued to flush all writes from SMs to system memory (default) or global memory (optional opt-in API). The Hardware Performance Monitor end trigger is issued after all SMs have been flushed.” In this sentence, which part of the hardware dose “system memory (default)” represent? Is it L2?
CUDA guarantees that all results written by a workload (grid, memcpy, memset) are accessible by a dependent operation. These operations include GPU workloads (grid, memcpy, memset) and CPU workloads (host callback, memory read). The CUDA programming model has to guarantee that all writes are coherent with future dependent work.
SM in flight writes to host memory must reach host memory as acknowledged by the interconnect (PCIe, C2C).
SM in flight writes to device memory (global or local memory write) are acknowledge by GPU L2.
“1、SM in flight writes to host memory must reach host memory as acknowledged by the interconnect (PCIe, C2C);2、SM in flight writes to device memory (global or local memory write) are acknowledge by GPU L2.” Will these two steps be counted by “gpc__cycles_elapsed.max”? Dose GPC “gpc__cycles_elapsed.max” end with step 1 or step 2?
The memory barrier (all writes much reach point of coherence – SYSMEM requires ACK from PCIe controller, VIDMEM reaches GPU L2) is part of the measurement window (included in gpc__elapsed_cycles) as this is part of the grid execution. No dependent commands, including the command to stop counting, can be executed on the stream until the memory barrier completes.
The L1/L2 caches are flushed and invalidated prior to execution of the workload. The L1/L2 caches are flushed to the point of coherence after the kernel completed but prior to stopping collecting and launching any dependent work. The GPU L2 is the point of coherence for device memory so the barrier does not require writes to device memory to reach device memory. The barrier complete at the L2.
Time -->
[Ts][Launch][First Warp ---> Last Warp][Flush][Te]
[elapsed cycles ]
where,
Ts - trigger start collection
Te - trigger end collection
Launch overhead can be approximated by null kernel.
Flush overhead varies by the amount of in flight memory writes to each destination at the end of grid.