Maximum Tensor Core utilization

User563213 · March 7, 2025, 8:55am

Hello, I am conducting a microbenchmark to test how much utilization can be achieved when using both CUDA cores and Tensor cores simultaneously within a single GPU kernel. First, I’m trying to assess the maximum utilization of Tensor core operations. (The device I’m using is RTX3090)

#define TENSOR_ITER 8192
#define SHARED_MEM_SIZE 12

extern "C" __global__ void mma_fp16_acc_fp32(float *out) {
    
    uint32_t tid = threadIdx.x;
    
    // FP16 Shared memory
    __shared__ half shared_A[SHARED_MEM_SIZE];
    __shared__ half shared_B[SHARED_MEM_SIZE];
    float frag_d[12];

    if (tid < SHARED_MEM_SIZE){
        shared_A[tid] = const_A[0];
    }
    if ((tid >= SHARED_MEM_SIZE) && (tid < SHARED_MEM_SIZE*2)){
        shared_B[tid] = const_A[0];
        frag_d[tid] = 0.0f;
    }
    
    // Synchronize all threads within TB
    __syncthreads();
    
    uint32_t const *A = reinterpret_cast<uint32_t const *>(&shared_A[0]);
    uint32_t const *B = reinterpret_cast<uint32_t const *>(&shared_B[0]);
    float *C = reinterpret_cast<float *>(&frag_d[0]);
    float *D = C;

    for (unsigned int i = 0; i < TENSOR_ITER; ++i){
        asm volatile(
            "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
            "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
            : "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])
            : "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]), 
            "r"(B[0]), "r"(B[1]),
            "f"(C[0]), "f"(C[1]), "f"(C[2]), "f"(C[3])
        );

        asm volatile(
            "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
            "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
            : "=f"(D[4]), "=f"(D[5]), "=f"(D[6]), "=f"(D[7])
            : "r"(A[4]), "r"(A[5]), "r"(A[6]), "r"(A[7]), 
                "r"(B[2]), "r"(B[3]),
                "f"(C[4]), "f"(C[5]), "f"(C[6]), "f"(C[7])
        );

        asm volatile(
            "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
            "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
            : "=f"(D[8]), "=f"(D[9]), "=f"(D[10]), "=f"(D[11])
            : "r"(A[8]), "r"(A[9]), "r"(A[10]), "r"(A[11]), 
                "r"(B[4]), "r"(B[5]),
                "f"(C[8]), "f"(C[9]), "f"(C[10]), "f"(C[11])
        );

        __syncwarp();
    }

    if (tid == 0){
        out[blockIdx.x] = D[0];
    }

}

As above code, To maximize Tensor core utilization, I wrote a kernel that explicitly inserts Tensor core operations into PTX assembly, ensuring that Tensor cores are used as much as possible.
However, no matter how much I increase the number of iterations and minimize memory access, the Nsight Compute metric sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active only reaches a maximum value of 50%.
Here is the profiled result.

I am wondering whether this value is the hardware-imposed upper limit or if there is a way to further increase the measured metric.

felix_dt · March 7, 2025, 9:38am

50% is not the HW limit as such. The metric is designed to reach 100% (or close to that) when the maximum utilization is reached. However, it’s not trivial to achieve that in all cases. E.g., sometimes different types of instructions need to be issued in specific sequence to reach max utilization.

I will check with the team on more details.

Greg · March 9, 2025, 9:56pm

The SM used in GeForce cards has a limit of 50% for HMMA (FP16, BF16, TF32) with FP32 accumulate. This is a defect in the metric that cannot be fixed due to the number of src x dst x sparsity combinations handled by the PM counter.

The metric sm__ops_path_tensor_src_fp16_dst_fp32.avg.pct_of_peak_sustained_elapsed should be correctly scaled to show it reaching 100% for your specific use case.

User563213 · March 10, 2025, 3:43am

Thank you for the detailed answer!

Topic		Replies	Views
Can you use nsight to see tensor core occupancy? Nsight Compute cudnn	4	1042	March 23, 2024
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	2660	May 15, 2024
Why Low Tensor Pipe Utilization CUDA Programming and Performance cuda , kernel	4	1345	May 20, 2022
How can I get the utilization of cuda core and tensor core respectively? Profiling Linux Targets	5	3262	January 10, 2023
How to measure Tensor core utilization using NVIDIA profiling tools such as Nsight System, DLProf, nvprof etc TensorRT cudnn	4	1702	January 31, 2024
Global Memory Access Optimization, tex throttling Nsight Compute cuda , kernel	6	695	May 8, 2024
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2559	August 12, 2017
Nsight Profile of NVIDIA/CUDALibrarySamples/cuTENSOR. Does it use CUDA Programming and Performance	4	529	November 22, 2022
How can I prevent my customized CUDA kernel function from using tensor cores on a Jetson Orin device? Jetson AGX Orin cuda , kernel	19	1033	February 5, 2024
Tips for Optimizing GPU Performance Using Tensor Cores Technical Blog	15	1042	July 24, 2019

Maximum Tensor Core utilization

Related topics