What are the start, end, and pause signs of the indicator "elapsed cycles" count in the nsight compute report?

yuanxd20 · January 4, 2024, 4:19am

Hi

I found that when I execute an empty kernel function, the “elapsed cycle” is not 0. May I know what phases the “elapsed cycle” consists of?

my test files is:
include <stdio.h>
include <cuda.h>
include <cuda_runtime.h>

global void setDateNums(uint* numType) {
}

int main()
{

uint *a;
cudaMalloc((void **)&a, sizeof(uint)*2);
setDateNums<<<1, 2>>>(a);
cudaDeviceSynchronize();

return 0;
}

I have two questions:
1、According to this Ampere architecture diagram and “Elapsed Cycles”'s description：

In the GPC having the max cycles, the start mark of elapsed cycles is when the scheduler sends the task to the first SM, and the end mark is when the last SM completes the task and returns information to the scheduler. This period includes the cycle of SM parameters acquisition, kernel acquisition and other processes. Therefore, the “Elapsed cycles” of the empty kernel is not 0. Is it correct?
2、“The average SM Active Cycles” is always smaller than “Elapsed Cycles”. Is it correct?

Thanks.

Greg · January 5, 2024, 11:25pm

The Hardware Performance Monitor start trigger is issued before the grid is launched by the front end (command processor) to the compute pipe. The grid is scheduled to the compute work distributor. The compute work distributor launches thread blocks on SMs. Each thread block will execute a few instructions (disassemble the kernel). The thread block will complete, free resources, and be reported as completed. When all thread blocks complete a memory barrier is issued to flush all writes from SMs to system memory (default) or global memory (optional opt-in API). The Hardware Performance Monitor end trigger is issued after all SMs have been flushed.

I would expect the gpc__cycles_elapsed.avg for a single thread block null kernel to be between 1000-9000 cycles depending on the chip and the frequency during collection.

There are elapsed cycle counters for each performance monitor. These are denoted as <unit>__cycles_elapsed. This allows the metrics library to handle different clock domains.

sm__cycles_elapsed == gpc__cycles_elapsed are in the same clock domain so should have very close values if not exactly the same value.

If collected in the same replay then sm__cycles_active.avg is guaranteed to be less than or equal to sm__cycles_elapsed.avg. If these counters are collected in different passes then there is a chance that this condition will not be true.

yuanxd20 · January 8, 2024, 11:03am

Hi,
Greg

Will the L1 cache and L2 cache be invalided? If so, will the invalidation process be counted in gpc__cycles_elapsed.avg? Is the invalidation process done during the “Each thread block will execute a few instructions (disassemble the kernel)” stage?

Thanks.

Greg · January 8, 2024, 1:56pm

Will the L1 cache and L2 cache be invalided?

If you are running Nsight Compute the option –cache-control can be used to control invalidation.

  --cache-control arg (=all)            Control the behavior of the GPU caches during profiling. Allowed values:
                                          all
                                          none

If so, will the invalidation process be counted in gpc__cycles_elapsed.avg?

No. The state save, state restore, and cache invalidation are performed outside of the measurement window.

yuanxd20 · January 10, 2024, 9:09am

Hi,
Greg

I got it, thank you. I have another question. “When all thread blocks complete a memory barrier is issued to flush all writes from SMs to system memory (default) or global memory (optional opt-in API). The Hardware Performance Monitor end trigger is issued after all SMs have been flushed.” In this sentence, which part of the hardware dose “system memory (default)” represent? Is it L2?

Thank you so much.

Greg · January 10, 2024, 4:15pm

CUDA guarantees that all results written by a workload (grid, memcpy, memset) are accessible by a dependent operation. These operations include GPU workloads (grid, memcpy, memset) and CPU workloads (host callback, memory read). The CUDA programming model has to guarantee that all writes are coherent with future dependent work.

SM in flight writes to host memory must reach host memory as acknowledged by the interconnect (PCIe, C2C).
SM in flight writes to device memory (global or local memory write) are acknowledge by GPU L2.

yuanxd20 · January 11, 2024, 1:50am

Hi,
Greg

“1、SM in flight writes to host memory must reach host memory as acknowledged by the interconnect (PCIe, C2C)；2、SM in flight writes to device memory (global or local memory write) are acknowledge by GPU L2.” Will these two steps be counted by “gpc__cycles_elapsed.max”? Dose GPC “gpc__cycles_elapsed.max” end with step 1 or step 2?

Thanks.

Greg · January 12, 2024, 9:54pm

The memory barrier (all writes much reach point of coherence – SYSMEM requires ACK from PCIe controller, VIDMEM reaches GPU L2) is part of the measurement window (included in gpc__elapsed_cycles) as this is part of the grid execution. No dependent commands, including the command to stop counting, can be executed on the stream until the memory barrier completes.

yuanxd20 · January 16, 2024, 6:25am

Hi,
Greg

Will the process of updating data in L2 to VIDMEM be counted by “gpc__cycles_elapsed.max”?

Thank you so much for your patient answer.

Greg · January 16, 2024, 4:51pm

The L1/L2 caches are flushed and invalidated prior to execution of the workload. The L1/L2 caches are flushed to the point of coherence after the kernel completed but prior to stopping collecting and launching any dependent work. The GPU L2 is the point of coherence for device memory so the barrier does not require writes to device memory to reach device memory. The barrier complete at the L2.

yuanxd20 · January 17, 2024, 7:47am

Thanks, your answer is very helpful to me.

liulongjun1988 · January 29, 2024, 3:16am

Time -->
[Ts][Launch][First Warp ---> Last Warp][Flush][Te]
    [elapsed cycles                              ]
where,
  Ts - trigger start collection
  Te - trigger end collection

Launch overhead can be approximated by null kernel.
Flush overhead varies by the amount of in flight memory writes to each destination at the end of grid.

liulongjun1988 · January 31, 2024, 9:23am

If we Launch a single kernel，[Flush] before [Te] is Flush l1 or Flush L2?

system · February 14, 2024, 9:23am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Latency and low-level performance questions CUDA Programming and Performance	10	4265	June 23, 2015
How to accurately time individual memory operations CUDA Programming and Performance	17	6181	September 12, 2016
What happens to the GPU cache at the end of the kernel? CUDA Programming and Performance	8	2773	September 30, 2020
Infinite loop in CUDA kernel CUDA Programming and Performance	11	15931	October 25, 2010
What exactly does SM Active Cycles mean? Nsight Compute	3	569	July 30, 2024
Why atomic operation on sysmem always miss CUDA Programming and Performance	7	799	February 15, 2022
Discrepancy between theoretical occupancy and achieved occupancy depending on ThreadsPerBlock CUDA Programming and Performance cuda	7	76	September 6, 2024
CUDA implicit synchronization behavior and conditions in detail CUDA Programming and Performance	3	1481	April 29, 2023
Nsight Compute-Roofline chart Nsight Compute	12	540	September 20, 2024
Can you use nsight to see tensor core occupancy? Nsight Compute cudnn	4	891	March 23, 2024

What are the start, end, and pause signs of the indicator "elapsed cycles" count in the nsight compute report?

Related topics