Hello!
I’ve wrote some benchmarks and observed that the MFU and ALU throughput on the Nvidia Orin is only ~1/6 of that on the RTX 3080. However, the FMA and TensorCore throughput on Orin can reach up to ~1/3 of the RTX 3080’s performance.
Based on the white paper, 3080 has 119TFOPS FP16 TensorCore throughput. For Orin, the number is 42.5TFLOPs, which makes a 2.8X difference. The profiling of our various workloads have also verified this number proportion.
However, when it comes to MFU or ALU, the throughput observed from our experiments show a ~6X difference.
For example, the following experiment profiles the throughput of hexp2 operations which are performed on MFU:
__global__ void testing_kernel(const half* A, const half* B, half* C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
half sum = A[idx];
if (idx < N) {
for(int i = 0; i < (1<<20); i++) {
sum += hexp2(sum);
}
C[idx] = sum;
}
}
// Utility function to check for CUDA errors
void checkCudaError(cudaError_t err, const char* msg) {
if (err != cudaSuccess) {
std::cerr << "Error: " << msg << " (" << cudaGetErrorString(err) << ")" << std::endl;
exit(EXIT_FAILURE);
}
}
void my_kernel_run() {
int N = 512*64;
half* h_A = (half*) malloc(N * sizeof(half));
half* h_B = (half*) malloc(N * sizeof(half));
half* h_C = (half*) malloc(N * sizeof(half));
for (int i = 0; i < N; i++) {
h_A[i] = (1.0f + i) / 10000;
h_B[i] = (2.0f + i) / 10000;
}
half* d_A, *d_B, *d_C;
checkCudaError(cudaMalloc((void**)&d_A, N * sizeof(half)), "Allocating d_A");
checkCudaError(cudaMalloc((void**)&d_B, N * sizeof(half)), "Allocating d_B");
checkCudaError(cudaMalloc((void**)&d_C, N * sizeof(half)), "Allocating d_C");
// Copy data from host to device
checkCudaError(cudaMemcpy(d_A, h_A, N * sizeof(half), cudaMemcpyHostToDevice), "Copying to d_A");
checkCudaError(cudaMemcpy(d_B, h_B, N * sizeof(half), cudaMemcpyHostToDevice), "Copying to d_B");
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
testing_kernel<<<numBlocks, blockSize>>>(d_A, d_B, d_C, N);
checkCudaError(cudaGetLastError(), "Launching vectorAddFP16 kernel");
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
}
I’m curious about the reasons behind this discrepancy and wondering if there’s a way to optimize MFU and ALU throughput on Orin to achieve closer to 1/3rd of the RTX 3080’s performance as well.
Thank you so much!