GPU metrics in the Nsight System

user114818 · October 2, 2023, 2:34am

Hi,

I am profiling a CUDA application in Nsight System and wondering what is “GPC Clock Frequency” and “SYS Clock Frequency” in GPU metrics. I am getting different values from those clocks, and I can’t find the information of what it is anywhere, including the Nsight System documentation.

ztasoulas · October 2, 2023, 2:58pm

Please see User Guide — nsight-systems 2023.4.1 documentation (link to “Available metrics”).

GPC Clock Frequency - gpc__cycles_elapsed.avg.per_second
The average GPC clock frequency in hertz. In public documentation the GPC clock may be called the “Application” clock, “Graphic” clock, “Base” clock, or “Boost” clock. Note: The collection mechanism for GPC can result in a small fluctuation between samples.

SYS Clock Frequency - sys__cycles_elapsed.avg.per_second
The average SYS clock frequency in hertz. The GPU front end (command processor), copy engines, and the performance monitor run at the SYS clock. On Turing and NVIDIA GA100 GPUs the sampling frequency is based upon a period of SYS clocks (not time) so samples per second will vary with SYS clock. On NVIDIA GA10x GPUs the sampling frequency is based upon a fixed frequency clock. The maximum frequency scales linearly with the SYS clock.

lemon.juice · October 12, 2024, 8:55am

Hello, the value of GPC Clock Frequency and SYS Clock Frequency varied in my profiling results,
and the kernel duration is affected by them.

So I want to know why these two frequency affect kernel duration, and how to affect?

I use below code to profile a kernel 1000 times. The kernel duration is ~0.22 ms in the beginning, and ~018 ms at last. I found it’s because of GPC Clock Frequency and SYS Clock Frequency changed.

run_nsys.sh

#!/bin/sh

export OMP_NUM_THREADS=1

export CUDA_VISIBLE_DEVICES=4,5
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

MASTER_ADDR=localhost
MASTER_PORT=29500

nsys profile --gpu-metrics-devices=cuda-visible -o ./report/attn_model_kernel/nsys/tmp1 \
    torchrun --nproc_per_node=2 --master_addr $MASTER_ADDR --master_port $MASTER_PORT prof_kernel.py

prof_kernel.py

import os
import numpy as np

import torch
import torch.distributed as dist

import torch.utils.benchmark as benchmark
from torch.backends.cuda import sdp_kernel, SDPBackend
import torch.multiprocessing as mp

backend_map = {
    SDPBackend.MATH: {"enable_math": True, "enable_flash": False, "enable_mem_efficient": False},
    SDPBackend.FLASH_ATTENTION: {"enable_math": False, "enable_flash": True, "enable_mem_efficient": False},
    SDPBackend.EFFICIENT_ATTENTION: {
        "enable_math": False, "enable_flash": False, "enable_mem_efficient": True}
}

def compute_kernel(max_num=10):
    rank_id = torch.cuda.current_device()
    device = f"cuda:{rank_id}"    
    batch_size = 2
    max_sequence_len = 512
    num_heads = 96
    embed_dimension = 128        
    dtype = torch.bfloat16        
    query = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, dtype=dtype, device=device)
    key = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, dtype=dtype, device=device)
    value = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, dtype=dtype, device=device)
    attn_output = None
    for _ in range(max_num):
        # prof.step()        
        with sdp_kernel(**backend_map[SDPBackend.EFFICIENT_ATTENTION]):
            try:
                # is_causal = True if attention_mask is None and q_len > 1 else False            
                attn_output = torch.nn.functional.scaled_dot_product_attention(
                    query=query,
                    key=key,
                    value=value,
                    attn_mask=None,
                    dropout_p=0.0,
                    is_causal=True,
                )
            except RuntimeError:
                print("EfficientAttention is not supported. See warnings for reasons.")
    return attn_output
   
def main():
    os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0"
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    my_device = f"cuda:{rank}"
    torch.cuda.set_device(my_device)    
    attn_out = compute_kernel(max_num=1000)

 
if __name__ == "__main__":
    main()

Greg · October 15, 2024, 7:06pm

GPCCLK is the clock for the SM and L1TEX cache. Compute bound kernels scale close to linearly with the GPCCLK. It is hard to state how memory bound (L2 or DRAM) limited or latency bound (L2 or DRAM) vary with GPCCLK. In some cases the duration will not be impacted in other cases the compute between issuing dependent memory operations or issuing memory operations at a sufficient rate will be limited by a slower GPCCLK.

Topic		Replies	Views
Kernel time of Nsight system is larger than nsight compute Profiling Linux Targets	11	849	April 3, 2024
Cycles in nsight-compute and nsight-systems Nsight Compute	2	1187	October 26, 2022
GPU SM Frequency CUDA Programming and Performance	2	210	August 15, 2024
Nsight Compute slows down Tesla T4 processor clock during profiling Nsight Compute	5	802	October 12, 2021
Kernel pipeline slows gradually CUDA Programming and Performance	11	55	December 21, 2024
Sum of kernel time is different in ncu and nsys Profiling Linux Targets nsight	11	3126	March 15, 2022
Optimizing Memory with NVIDIA Nsight Systems Technical Blog	1	450	June 28, 2023
Nsight Compute Clock Speed During Profiling Nsight Compute	4	1681	March 31, 2022
Difference between nsight-compute and nsys for calculating average value Nsight Compute	2	835	October 12, 2021
Is the profiling session duration equivalent to total runtime when using Nsight Systems? Profiling Linux Targets cuda , kernel , profiling	11	451	May 6, 2024

GPU metrics in the Nsight System

Related topics