What are the start, end, and pause signs of the indicator "elapsed cycles" count in the nsight compute report?

Hi

I found that when I execute an empty kernel function, the “elapsed cycle” is not 0. May I know what phases the “elapsed cycle” consists of?


my test files is:
include <stdio.h>
include <cuda.h>
include <cuda_runtime.h>

global void setDateNums(uint* numType) {
}

int main()
{

uint *a;
cudaMalloc((void **)&a, sizeof(uint)*2);
setDateNums<<<1, 2>>>(a);
cudaDeviceSynchronize();

return 0;
}

I have two questions:
1、According to this Ampere architecture diagram and “Elapsed Cycles”'s description:


image
In the GPC having the max cycles, the start mark of elapsed cycles is when the scheduler sends the task to the first SM, and the end mark is when the last SM completes the task and returns information to the scheduler. This period includes the cycle of SM parameters acquisition, kernel acquisition and other processes. Therefore, the “Elapsed cycles” of the empty kernel is not 0. Is it correct?
2、“The average SM Active Cycles” is always smaller than “Elapsed Cycles”. Is it correct?

Thanks.

The Hardware Performance Monitor start trigger is issued before the grid is launched by the front end (command processor) to the compute pipe. The grid is scheduled to the compute work distributor. The compute work distributor launches thread blocks on SMs. Each thread block will execute a few instructions (disassemble the kernel). The thread block will complete, free resources, and be reported as completed. When all thread blocks complete a memory barrier is issued to flush all writes from SMs to system memory (default) or global memory (optional opt-in API). The Hardware Performance Monitor end trigger is issued after all SMs have been flushed.

I would expect the gpc__cycles_elapsed.avg for a single thread block null kernel to be between 1000-9000 cycles depending on the chip and the frequency during collection.

There are elapsed cycle counters for each performance monitor. These are denoted as <unit>__cycles_elapsed. This allows the metrics library to handle different clock domains.

sm__cycles_elapsed == gpc__cycles_elapsed are in the same clock domain so should have very close values if not exactly the same value.

If collected in the same replay then sm__cycles_active.avg is guaranteed to be less than or equal to sm__cycles_elapsed.avg. If these counters are collected in different passes then there is a chance that this condition will not be true.

Hi,
Greg

Will the L1 cache and L2 cache be invalided? If so, will the invalidation process be counted in gpc__cycles_elapsed.avg? Is the invalidation process done during the “Each thread block will execute a few instructions (disassemble the kernel)” stage?

Thanks.

Will the L1 cache and L2 cache be invalided?

If you are running Nsight Compute the option –cache-control can be used to control invalidation.

  --cache-control arg (=all)            Control the behavior of the GPU caches during profiling. Allowed values:
                                          all
                                          none

If so, will the invalidation process be counted in gpc__cycles_elapsed.avg?

No. The state save, state restore, and cache invalidation are performed outside of the measurement window.

Hi,
Greg

I got it, thank you. I have another question. “When all thread blocks complete a memory barrier is issued to flush all writes from SMs to system memory (default) or global memory (optional opt-in API). The Hardware Performance Monitor end trigger is issued after all SMs have been flushed.” In this sentence, which part of the hardware dose “system memory (default)” represent? Is it L2?

Thank you so much.

CUDA guarantees that all results written by a workload (grid, memcpy, memset) are accessible by a dependent operation. These operations include GPU workloads (grid, memcpy, memset) and CPU workloads (host callback, memory read). The CUDA programming model has to guarantee that all writes are coherent with future dependent work.

SM in flight writes to host memory must reach host memory as acknowledged by the interconnect (PCIe, C2C).
SM in flight writes to device memory (global or local memory write) are acknowledge by GPU L2.

Hi,
Greg

“1、SM in flight writes to host memory must reach host memory as acknowledged by the interconnect (PCIe, C2C);2、SM in flight writes to device memory (global or local memory write) are acknowledge by GPU L2.” Will these two steps be counted by “gpc__cycles_elapsed.max”? Dose GPC “gpc__cycles_elapsed.max” end with step 1 or step 2?

Thanks.

The memory barrier (all writes much reach point of coherence – SYSMEM requires ACK from PCIe controller, VIDMEM reaches GPU L2) is part of the measurement window (included in gpc__elapsed_cycles) as this is part of the grid execution. No dependent commands, including the command to stop counting, can be executed on the stream until the memory barrier completes.

1 Like

Hi,
Greg

Will the process of updating data in L2 to VIDMEM be counted by “gpc__cycles_elapsed.max”?

Thank you so much for your patient answer.

The L1/L2 caches are flushed and invalidated prior to execution of the workload. The L1/L2 caches are flushed to the point of coherence after the kernel completed but prior to stopping collecting and launching any dependent work. The GPU L2 is the point of coherence for device memory so the barrier does not require writes to device memory to reach device memory. The barrier complete at the L2.

1 Like

Thanks, your answer is very helpful to me.

Time -->
[Ts][Launch][First Warp ---> Last Warp][Flush][Te]
    [elapsed cycles                              ]
where,
  Ts - trigger start collection
  Te - trigger end collection

Launch overhead can be approximated by null kernel.
Flush overhead varies by the amount of in flight memory writes to each destination at the end of grid.

If we Launch a single kernel,[Flush] before [Te] is Flush l1 or Flush L2?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.