Difference in TensorCore/FMA and MFU/ALU Performance on Orin Compared to 3080

silvaurus17 · September 19, 2024, 10:58pm

Hello!

I’ve wrote some benchmarks and observed that the MFU and ALU throughput on the Nvidia Orin is only ~1/6 of that on the RTX 3080. However, the FMA and TensorCore throughput on Orin can reach up to ~1/3 of the RTX 3080’s performance.

Based on the white paper, 3080 has 119TFOPS FP16 TensorCore throughput. For Orin, the number is 42.5TFLOPs, which makes a 2.8X difference. The profiling of our various workloads have also verified this number proportion.

However, when it comes to MFU or ALU, the throughput observed from our experiments show a ~6X difference.
For example, the following experiment profiles the throughput of hexp2 operations which are performed on MFU:

__global__ void testing_kernel(const half* A, const half* B, half* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    half sum = A[idx];
    if (idx < N) {
        for(int i = 0; i < (1<<20); i++) {
          sum += hexp2(sum);
        }
        C[idx] = sum;
    }
}

// Utility function to check for CUDA errors
void checkCudaError(cudaError_t err, const char* msg) {
    if (err != cudaSuccess) {
        std::cerr << "Error: " << msg << " (" << cudaGetErrorString(err) << ")" << std::endl;
        exit(EXIT_FAILURE);
    }
}


void my_kernel_run() {
  int N = 512*64;

  half* h_A = (half*) malloc(N * sizeof(half));
  half* h_B = (half*) malloc(N * sizeof(half));
  half* h_C = (half*) malloc(N * sizeof(half));

  for (int i = 0; i < N; i++) {
    h_A[i] = (1.0f + i) / 10000;
    h_B[i] = (2.0f + i) / 10000;
  }

  half* d_A, *d_B, *d_C;
  checkCudaError(cudaMalloc((void**)&d_A, N * sizeof(half)), "Allocating d_A");
  checkCudaError(cudaMalloc((void**)&d_B, N * sizeof(half)), "Allocating d_B");
  checkCudaError(cudaMalloc((void**)&d_C, N * sizeof(half)), "Allocating d_C");

  // Copy data from host to device
  checkCudaError(cudaMemcpy(d_A, h_A, N * sizeof(half), cudaMemcpyHostToDevice), "Copying to d_A");
  checkCudaError(cudaMemcpy(d_B, h_B, N * sizeof(half), cudaMemcpyHostToDevice), "Copying to d_B");

  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;

  testing_kernel<<<numBlocks, blockSize>>>(d_A, d_B, d_C, N);
  checkCudaError(cudaGetLastError(), "Launching vectorAddFP16 kernel");

  cudaFree(d_A);
  cudaFree(d_B);
  cudaFree(d_C);
  free(h_A);
  free(h_B);
  free(h_C);
}

I’m curious about the reasons behind this discrepancy and wondering if there’s a way to optimize MFU and ALU throughput on Orin to achieve closer to 1/3rd of the RTX 3080’s performance as well.

Thank you so much!

AastaLLL · September 20, 2024, 3:19am

Hi,

Is this duplicate to the below comment?

If yes, maybe we can close that one and start tracking the issue on this topic as this is more specific.

We want to reproduce this issue in our environment first.
Would you mind sharing a complete and compilable source with us?

Thanks.

silvaurus17 · September 20, 2024, 6:56pm

Hello! Thank you so much for your response.

Yes, these two are the same issue.
It would be wonderful if you could help us reproduce this issue.

We use Nsight system to calculate the throughput
nsys profile --gpu-metrics-device=all --gpu-metrics-set=ga10b-gfxt
The report would not only include the execution time of the kernel, but also include the utilization of various pipelines, e.g. FMA and MFU.

We can use Relative_throughput = 1 / (avg_pipeline_utilization% * execution_time) to estimate the relative throughput between 3080 and Orin.

The code we shown above is the code we use to profile MFU operation.
Changing testing_kernel to the following profile FMA operation:

__global__ void testing_kernel(const half* A, const half* B, half* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    half sum_1 = A[idx];
    half sum_2 = B[idx];
    if (idx < N) {
        for(int i = 0; i < (1<<20); i++) {
          sum_1 += sum_1 * sum_1;
          sum_2 += sum_2 * sum_2;
        }
        C[idx] = sum_1 + sum_2;
    }
}

The results show that for FMA pipeline, Relative_throughput_3080 / Relative_throughput_Orin is around 2.8X, while for MFU pipeline, Relative_throughput_3080 / Relative_throughput_Orin is around 5.9X, which demonstrates the different scaling factors of the pipelines between these two GPUs, while they are supposed to have the same SM architecture.

Thanks!

AastaLLL · September 23, 2024, 8:50am

Hi,

Thanks for the info.

We will try to reproduce this issue internally and share more info with you.

Thanks.

AastaLLL · October 3, 2024, 3:53am

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Thanks for your patience.

Our internal team has done some tests and wants to know where the Orin FP16 TensorCore throughput is from (42.5TFLOPs).
Would you mind sharing which document you found this data?

Thanks.

Topic		Replies	Views
Low TensorCore Active % Compared with 3080 Jetson AGX Orin tensorrt	4	121	September 23, 2024
The performance of the Jetson Orin Nano module does not match the data provided on the official website Jetson AGX Orin cuda , performance	15	2517	September 28, 2023
How to optimize the tensorRT Engine for Tensor Core? Jetson AGX Orin tensorrt , nvbugs	21	1695	August 2, 2023
Jetson AGX Orin TOPs / CUDA Cores Explained Jetson AGX Orin jetson-inference	8	5701	May 24, 2023
Orin low performance on mobilnetv1 ssd Jetson AGX Orin jetson-inference	7	1160	June 1, 2022
Can't run nvcr.io/nvidia/l4t-tensorrt:r8.2.1-runtime on Orin AGX Jetson AGX Orin tensorrt	19	1173	May 13, 2022
What's the spec of GA10b? How to calculate the FP16 computing capability of the CUDA cores of Orin? Jetson AGX Orin gpu , gpu-computing	6	336	September 9, 2024
Why I get much higher TFLOPS in Orin AGX than what claimed in the document IGX Developer Kit kernel , jetson-inference , documentation	7	278	November 4, 2024
Orin mlperf result Jetson AGX Orin performance	5	1622	May 25, 2022
Performance Discrepancy - Python API vs. trtexec on Jetson AGX Orin Board Jetson AGX Orin jetson-inference	8	732	July 10, 2023

Difference in TensorCore/FMA and MFU/ALU Performance on Orin Compared to 3080

Related topics