Difference in TensorCore/FMA and MFU/ALU Performance on Orin Compared to 3080

Hello!

I’ve wrote some benchmarks and observed that the MFU and ALU throughput on the Nvidia Orin is only ~1/6 of that on the RTX 3080. However, the FMA and TensorCore throughput on Orin can reach up to ~1/3 of the RTX 3080’s performance.

Based on the white paper, 3080 has 119TFOPS FP16 TensorCore throughput. For Orin, the number is 42.5TFLOPs, which makes a 2.8X difference. The profiling of our various workloads have also verified this number proportion.

However, when it comes to MFU or ALU, the throughput observed from our experiments show a ~6X difference.
For example, the following experiment profiles the throughput of hexp2 operations which are performed on MFU:

__global__ void testing_kernel(const half* A, const half* B, half* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    half sum = A[idx];
    if (idx < N) {
        for(int i = 0; i < (1<<20); i++) {
          sum += hexp2(sum);
        }
        C[idx] = sum;
    }
}

// Utility function to check for CUDA errors
void checkCudaError(cudaError_t err, const char* msg) {
    if (err != cudaSuccess) {
        std::cerr << "Error: " << msg << " (" << cudaGetErrorString(err) << ")" << std::endl;
        exit(EXIT_FAILURE);
    }
}


void my_kernel_run() {
  int N = 512*64;

  half* h_A = (half*) malloc(N * sizeof(half));
  half* h_B = (half*) malloc(N * sizeof(half));
  half* h_C = (half*) malloc(N * sizeof(half));

  for (int i = 0; i < N; i++) {
    h_A[i] = (1.0f + i) / 10000;
    h_B[i] = (2.0f + i) / 10000;
  }

  half* d_A, *d_B, *d_C;
  checkCudaError(cudaMalloc((void**)&d_A, N * sizeof(half)), "Allocating d_A");
  checkCudaError(cudaMalloc((void**)&d_B, N * sizeof(half)), "Allocating d_B");
  checkCudaError(cudaMalloc((void**)&d_C, N * sizeof(half)), "Allocating d_C");

  // Copy data from host to device
  checkCudaError(cudaMemcpy(d_A, h_A, N * sizeof(half), cudaMemcpyHostToDevice), "Copying to d_A");
  checkCudaError(cudaMemcpy(d_B, h_B, N * sizeof(half), cudaMemcpyHostToDevice), "Copying to d_B");

  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;

  testing_kernel<<<numBlocks, blockSize>>>(d_A, d_B, d_C, N);
  checkCudaError(cudaGetLastError(), "Launching vectorAddFP16 kernel");

  cudaFree(d_A);
  cudaFree(d_B);
  cudaFree(d_C);
  free(h_A);
  free(h_B);
  free(h_C);
}

I’m curious about the reasons behind this discrepancy and wondering if there’s a way to optimize MFU and ALU throughput on Orin to achieve closer to 1/3rd of the RTX 3080’s performance as well.

Thank you so much!

Hi,

Is this duplicate to the below comment?

If yes, maybe we can close that one and start tracking the issue on this topic as this is more specific.

We want to reproduce this issue in our environment first.
Would you mind sharing a complete and compilable source with us?

Thanks.

Hello! Thank you so much for your response.

Yes, these two are the same issue.
It would be wonderful if you could help us reproduce this issue.

We use Nsight system to calculate the throughput
nsys profile --gpu-metrics-device=all --gpu-metrics-set=ga10b-gfxt
The report would not only include the execution time of the kernel, but also include the utilization of various pipelines, e.g. FMA and MFU.

We can use Relative_throughput = 1 / (avg_pipeline_utilization% * execution_time) to estimate the relative throughput between 3080 and Orin.

The code we shown above is the code we use to profile MFU operation.
Changing testing_kernel to the following profile FMA operation:

__global__ void testing_kernel(const half* A, const half* B, half* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    half sum_1 = A[idx];
    half sum_2 = B[idx];
    if (idx < N) {
        for(int i = 0; i < (1<<20); i++) {
          sum_1 += sum_1 * sum_1;
          sum_2 += sum_2 * sum_2;
        }
        C[idx] = sum_1 + sum_2;
    }
}

The results show that for FMA pipeline, Relative_throughput_3080 / Relative_throughput_Orin is around 2.8X, while for MFU pipeline, Relative_throughput_3080 / Relative_throughput_Orin is around 5.9X, which demonstrates the different scaling factors of the pipelines between these two GPUs, while they are supposed to have the same SM architecture.

Thanks!

Hi,

Thanks for the info.

We will try to reproduce this issue internally and share more info with you.

Thanks.

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Thanks for your patience.

Our internal team has done some tests and wants to know where the Orin FP16 TensorCore throughput is from (42.5TFLOPs).
Would you mind sharing which document you found this data?

Thanks.