Why is the sm__warps_active so high

ftatp5901 · March 16, 2025, 12:15pm

Hi,

I am new to nsight compute, and having a hard time understanding about SMs.

Here is a simple code I’m working on

#include <iostream>
#include <cuda_runtime.h>
#include <vector>

__global__ void matrixMultiplyKernel(const float *A, const float *B, float *C, int A_rows, int A_cols, int B_cols)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < A_rows && col < B_cols)
    {
        float sum = 0.0f;
        for (int i = 0; i < A_cols; ++i)
        {
            sum += A[row * A_cols + i] * B[i * B_cols + col];
        }
        C[row * B_cols + col] = sum;
    }
}

void matrixMultiply(const float *h_A1, const float *h_B1, float *h_C1, int A_rows, int A_cols, int B_cols)
{

    size_t size_A = A_rows * A_cols * sizeof(float);
    size_t size_B = A_cols * B_cols * sizeof(float);
    size_t size_C = A_rows * B_cols * sizeof(float);

    float *d_A1, *d_B1, *d_C1;
    cudaMalloc((void **)&d_A1, size_A);
    cudaMalloc((void **)&d_B1, size_B);
    cudaMalloc((void **)&d_C1, size_C);
    
    cudaMemcpyAsync(d_A1, h_A1, size_A, cudaMemcpyHostToDevice);
    cudaMemcpyAsync(d_B1, h_B1, size_B, cudaMemcpyHostToDevice);

    dim3 blockSize(4, 4);
    dim3 gridSize((B_cols + blockSize.x - 1) / blockSize.x, (A_rows + blockSize.y - 1) / blockSize.y);

    matrixMultiplyKernel<<<gridSize, blockSize>>>(d_A1, d_B1, d_C1, A_rows, A_cols, B_cols);
    
    cudaMemcpyAsync(h_C1, d_C1, size_C, cudaMemcpyDeviceToHost);

    cudaFree(d_A1);
    cudaFree(d_B1);
    cudaFree(d_C1);
}

int main()
{
    int runtimeVersion = 0;
    cudaRuntimeGetVersion(&runtimeVersion);
    std::cout << "CUDA Runtime Version: " << runtimeVersion / 1000 << "." << (runtimeVersion % 1000) / 10 << std::endl;

    int sm_count = 0;
    cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, 0);
    std::cout << "Total SMs: " << sm_count << std::endl;


    const int A_rows = 4, A_cols = 3, B_rows = 3,  B_cols = 4;

    std::vector<float> h_A1(A_rows * A_cols, 1.0f);
    std::vector<float> h_B1(A_cols * B_cols, 2.0f);
    std::vector<float> h_C1(A_rows * B_cols, 0.0f);

    matrixMultiply(h_A1.data(), h_B1.data(), h_C1.data(), A_rows, A_cols, B_cols);

    return 0;
}

This code simply multiplies 2 matrix A B, each with the size of [4,3] and [3,4], resulting in a [4,4] sized matrix.

The gridsize is [1,1] and the blocksize is [4, 4] for the kernel.
Because of this blocksize, I expected that this will activate 1 warp, while there are only 4*4=16 threads in one block, and 1 warp can run 32 threads. However, when I runned this code with ncu using the matrix sm__warps_active, it looks like this:

sm__warps_active.avg [warp] 34.46
sm__warps_active.max [warp] 2826
sm__warps_active.min [warp] 0
sm__warps_active.sum [warp] 2826

So now I am very confused. Why isn’t this shown as 1, instead of 2826?

Thanks for your attention.

felix_dt · March 24, 2025, 11:52am

The metric is unfortunately not fully intuitive by just its name. If you inspect its description (e.g., through the Metric Details tool window), you will find it’s “cumulative # of warps in flight”. The “cumulative” (as in, “aggregated”) here refers to the elapsed cycles of the measurement. This means the metric counts the number of warps in flights, summed up over all elapsed cycles.

As an example, if you have a kernel with a single warp (like in your case), the .max of this metric is expected to match the total active cycles (sm__cycles_active.max). I hope this clarifies the confusion.

Greg · March 24, 2025, 7:05pm

The RTX 3090 has 82 SMs

sm__warps_active.avg [warp] 34.46
sm__warps_active.max [warp] 2826

sm__warps_active.avg is the “average” value on all SMs. The target grid is launching only 1 thread block so only 1 SM will have work.

34.46 * 82 = 2825.72

sm__warps_active.max is showing the highest value reported by all SMs.

Nsight Compute performs best if the workload fully saturates the GPU.

Topic		Replies	Views
What exactly does SM Active Cycles mean? Nsight Compute	3	896	July 30, 2024
Optimisation of occupancy summary table CUDA Programming and Performance	10	794	September 15, 2023
Some questions about one metric Nsight Compute cuda	7	775	January 25, 2022
Is there a way to inspect the time cost of each individual cuda block? Nsight Compute	12	186	October 30, 2024
Basic Cuda Confusion - help CUDA Programming and Performance	9	1901	February 11, 2013
Metric references and description Nsight Compute	7	4421	March 2, 2024
GPU slower than CPU in COO SpMV Multiplication? CUDA Programming and Performance	0	1310	June 6, 2010
I got many N/A when I used Nsight Compute on Ubuntu Nsight Compute cuda	2	473	October 12, 2021
Clarification on concept to hardware mapping CUDA Programming and Performance	2	2249	January 11, 2008
Eligible/Stalled warps CUDA Programming and Performance	2	1282	June 8, 2020

Why is the sm__warps_active so high

Related topics