Nsight Compute-Roofline chart

Hello,

I am new to using Nsight Compute and have a question about the roofline chart. When I profile different kernels on Nsight Compute and view their roofline charts, nothing is shown for some kernels, such as the histogram (in CUDA samples), which doesn’t have floating point operations. Does the roofline chart work only for kernels with floating point operations? I have seen in some academic papers that they classify different types of kernels using the roofline model. How do they do that?

Hi, @sahar.mobaiyen

You can select “roofline” metric set to enable more roofline charts.

Hi Veraj,
Thank you for your answer. I have tried all of those metric sets but for some special programs such as histogram in Cuda samples or Breath First Search in Rodinia, I can’t see the achieved value for the kernel on the roofline chart.

Hi, @sahar.mobaiyen

I tried histogram and your observation is right. I also don’t see the achieved values.
But I think this is expected as when I execute

ncu --set roofline --print-details body -c 1 ./histogram

And search “achieved value” in the output, I found related metrics are 0. Then the point won’t be shown.

Regarding if the metrics are 0 are reasonable, I will check internally and get back to you if any update.

Thank you for checking that. I will wait for the results.

My main goal in using a roofline chart is to classify the kernels based on their behavior (compute-intensive or memory-intensive) and then use the results in a scheduler. However, now the question is whether the roofline chart is applicable to all types of kernels or if it is designed for specific types.

A roofline is applicable to all types of kernels.

Nsight Compute does not provide rooflines for non-tensor integer operations.

Counting OPs is complicated. For FP16, FP32, and FP64 NCU uses shader patching to collect the number of predicated on threads for each instruction type and applies a weight of 2 per predicated on thread to fused multiply-add and a weight of 1 per predicated on thread to other ops.

The NVIDIA GPU has a large number of integer instructions. There is no industry standard for counting operations. For example, do the following count as integer operations:

  • shifts
  • logical operations
  • pop count
  • comparison

Furthermore, should address calculations be counted? From an assembly level it is not easy to distinguish address calculation form numeric calculation without full data flow analysis.

Nsight Compute does not currently have the correct metrics to generate a INT64, INT32, INT16, or INT8 roofline. The metric smsp__sass_thread_inst_executed_op_integer_pred_on includes integer instructions that I would not include in an “ops” counter, it includes tensor instructions, it does not separate data widths, etc.

Nsight Compute has a powerful python rule system and supports export of the source view. An integer (or other data format) roofline could be calculated by parsing the SASS code for a matching opcode and using either the inst executed or thread inst executed and a per opcode weight to get an ops count. If this was done the roofline section file or a new section file could be created to show the output. This would also require determine the maximum integer throughput which can be looked up or determined via micro-benchmarks.

If there are specific data type rooflines that you would like to see supported in Nsight Compute please file a bug report. Please try to be as specific as possible regarding what operations you want counted and what weight should be applied.

1 Like

I do not have specific data types for roofline analysis. My goal is to use Nsight parameters to classify different kernels as either compute-intensive or memory-intensive. After classification, I plan to profile them by assigning various numbers of SMs (Streaming Multiprocessors). By calculating the approximate AI (Arithmetic Intensity) and comparing it to the ridge point of the system, I aim to estimate the minimum number of SMs needed for each memory-intensive kernel.

To achieve this, I need to obtain the achieved AI for different kernel types. However, I am encountering difficulties with compute-intensive kernels, as I am not getting valid AI values from my metrics, even when using full metrics in NVIDIA Nsight. Currently, I use the following metrics to calculate the AI of the kernel:
Time

metrics="sm__cycles_elapsed.avg,\

sm__cycles_elapsed.avg.per_second,"

DP

metrics+="sm__sass_thread_inst_executed_op_dadd_pred_on.sum,\

sm__sass_thread_inst_executed_op_dfma_pred_on.sum,\

sm__sass_thread_inst_executed_op_dmul_pred_on.sum,"

SP

metrics+="sm__sass_thread_inst_executed_op_fadd_pred_on.sum,\

sm__sass_thread_inst_executed_op_ffma_pred_on.sum,\

sm__sass_thread_inst_executed_op_fmul_pred_on.sum,"

HP

metrics+="sm__sass_thread_inst_executed_op_hadd_pred_on.sum,\

sm__sass_thread_inst_executed_op_hfma_pred_on.sum,\

sm__sass_thread_inst_executed_op_hmul_pred_on.sum,"

Tensor Core

metrics+=“sm__inst_executed_pipe_tensor.sum,”

DRAM, L2 and L1

metrics+="dram__bytes.sum,\

lts__t_bytes.sum,\

l1tex__t_bytes.sum"

Can I use any other metrics for programs such as histogram, breadth-first search, and matrix transpose to calculate the approximate AI and classify them into memory-intensive and compute-intensive groups?

Why can’t you use the Compute and Memory Throughput metrics from the GPU Speed Of Light Throughput section, to classify your kernels as compute and/or memory intensive (just like ncu does)?

My initial project was about using the Roofline model to estimate the relationship between the achieved AI value and the minimum number of required SMs for memory-intensive kernels in spatial partitioning. Additionally, I investigated compute and memory throughput values for various numbers of SMs but could not identify a clear relationship between these values and the minimum SMs needed.

My personal opinion is that throughput metrics are superior method to arithmetic intensity given the SM pipelines. The new CUDA Green Context can help determine scaling by running on smaller and smaller Green Contexts. There is no method to limit the memory subsystem.

DP - This is correct for FLOPS and count only predicated true threads. A kernel with all DADD instructions (weight = 1) will have 1/2 the AI as a kernel using all DFMA instructions (weight = 2). Both could have the same pipeline throughput. This cannot be resolved with AI but pipeline throughput is part of the Compute Throughput.

SP - This is correct for FLOPS. See DP.

HP - This is correct for FLOPS. See DP.

Tensor - This is a bad choice as there are many different tensor instructions that have a different instruction issue rate and latency. For Tensor OPs use the sm__ops_path_tensor* metrics.

ncu --query-metrics | grep sm__ops_

EXAMPLE

sm__ops_path_tensor_src_fp16_dst_fp32
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_off
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_on

The metrics are in a string hierarchy.

sm__ops_path_tensor_src_fp16_dst_fp32 = sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_off + sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_on

Useful metric rollups are:

<metric>.sum is the total operations
<metric>.sum.per_second is the total operations/second
<metric>.avg.pct_of_peak_sustained_elapsed is the average % throughput on each SM (includes idle SM cycles)
<metric>.avg.pct_of_peak_sustained_active is the average % throughput on each SM (only active SM cycles)
<metric>.avg.peak_sustained is the number of operations/cycle/SM

NOTE: When in a string hierarchy the peak_sustained is the highest value of the children. This issue is raised as for most types sparsity_on has a 2x rate over sparsity_off. For an application using sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_off the maximum value for the metrics is as follows:

 sm__ops_path_tensor_src_fp16_dst_fp32.avg.pct_of_peak_sustained_elapsed = 50% as .peak_sustained based on sparsity_on.
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_off.pct_of_peak_sustained_elapsed = 100%

Can I use any other metrics for programs such as histogram, breadth-first search, and matrix transpose to calculate the approximate AI and classify them into memory-intensive and compute-intensive groups?

These can be trivially classified using the NCU Compute Throughput vs. Memory Throughput as

  1. Compute Bound
  2. Memory Bound
  3. Latency Bound (neither Compute or Memory is high)

Since NCU does not have AI calculations for all formats (primarily missing integer) then samples that are not floating point or tensor based cannot be classified using a roofline.

Thank you for your detailed response.

It seems I should shift my approach to using compute and memory throughputs instead of AI. However, it’s crucial for me to minimize the metrics used to reduce overhead during the classification phase. Could you guide me on which metrics would be most effective for estimating kernel behavior? Are there standard thresholds for compute or memory throughput values to classify kernels effectively?

Additionally, CUDA Green Context appears to be a promising solution for my problem. Could you provide any example code related to this?

Additionally, CUDA Green Context appears to be a promising solution for my problem. Could you provide any example code related to this?

Here is a minimal example illustrating the use of green contexts:

#include <cuda_runtime.h>
#include <cuda.h>
#include <cstdio>
#include <cassert>
#include <cstdint>

#define CUDA_RT(call)                                                   \
    do {                                                                \
        cudaError_t _err = (call);                                      \
        if ( cudaSuccess != _err ) {                                    \
            fprintf(stderr, "CUDA error in file '%s' at line %i: %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(_err));      \
            return _err;                                                \
        } } while (0)


#define CUDA_DRV(call)                                                  \
    do {                                                                \
        CUresult _status = (call);                                      \
        if ( CUDA_SUCCESS != _status) {                                 \
            fprintf(stderr, "CUDA error in file '%s' at line %i: %i\n", \
                    __FILE__, __LINE__, _status);                       \
            return _status;                                             \
        } } while (0)

__device__ int temp_result;

extern "C" __global__ void timewaster(const int num_iterations)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    volatile float result = 0;
    for (int i = 0; i < num_iterations; i++) {
        result += sqrtf(i + idx) + sinf(idx * i);
    }
    if (idx == 0) {
        temp_result = (int)result;
    }
}

__global__ void kernel()
{
    temp_result += threadIdx.x * blockIdx.x;
}

int main()
{
    CUgreenCtx gctx[2];
    CUdevResourceDesc desc[2];
    CUdevResource input;
    CUdevResource resources[2];
    CUstream streamA;
    CUstream streamB;

    unsigned int nbGroups = 1;
    unsigned int minCount = 0;

    // Initialize device 0
    CUDA_RT(cudaInitDevice(0, 0, 0));
    // Preload
    timewaster<<<1, 512>>>(1);
    kernel<<<1, 512>>>();

    // Query input SMs
    CUDA_DRV(cuDeviceGetDevResource((CUdevice)0, &input, CU_DEV_RESOURCE_TYPE_SM));
    // We want 3/4 the device for our green context
    minCount = (unsigned int)((float)input.sm.smCount * 0.75f);

    // Split my resources
    CUDA_DRV(cuDevSmResourceSplitByCount(&resources[0], &nbGroups, &input, &resources[1], 0, minCount));

    // Create a descriptor/ctx for the main 3/4 partion
    CUDA_DRV(cuDevResourceGenerateDesc(&desc[0], &resources[0], 1));
    CUDA_DRV(cuGreenCtxCreate(&gctx[0], desc[0], (CUdevice)0, CU_GREEN_CTX_DEFAULT_STREAM));
    // ... and one for the remainder 1/4 partition
    CUDA_DRV(cuDevResourceGenerateDesc(&desc[1], &resources[1], 1));
    CUDA_DRV(cuGreenCtxCreate(&gctx[1], desc[1], (CUdevice)0, CU_GREEN_CTX_DEFAULT_STREAM));
    // Create streams that we will use from here on out
    CUDA_DRV(cuGreenCtxStreamCreate(&streamA, gctx[0], CU_STREAM_NON_BLOCKING, 0));
    CUDA_DRV(cuGreenCtxStreamCreate(&streamB, gctx[1], CU_STREAM_NON_BLOCKING, 0));

    timewaster<<<1000, 512, 0, (cudaStream_t)streamA>>>(1000);
    kernel<<<1, 512, 0, (cudaStream_t)streamB>>>();

    CUDA_RT(cudaStreamSynchronize((cudaStream_t)streamA));
    CUDA_RT(cudaStreamSynchronize((cudaStream_t)streamB));

    return (0);
}

The full documentation of the driver API can be found here.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.