Conditions on NVJet kernels on Jetson Thor

quan.luo.101 · November 5, 2025, 4:24pm

I’m using CuBLAS to do some simple benchmarking on Jetson Thor.

I found that if I’m using FP16 + FP16 acc, CuBLAS would

Select `cutlass3x_sm100_tensorop_h256x256x16gemm_f16_f16_f16_f16_f16_256x256x64_0_ttn_align8_2sm_bias_f16_relu_stream_k` kernel
Achieve ~60 TFLOPs

And if I’m using FP16 + FP32 acc, CuBLAS would

Select `nvjet_hsh_128x256_64x6_2x1_2cta_v_bz_TTT` kernel
Achieve ~160 TFLOPs

I searched on the Internet and found that the nvjet kernel seems to be tuned for jetson platform. I’m wondering why it’s not used by CuBLAS if we are using FP16 acc?

Is it only supported for mixed precision inference? Does that suggest us to use mix-precision instead of pure FP16?

AastaLLL · November 6, 2025, 4:46am

Hi,

nvJet is cuBLAS backends and not specific to Jetson.

Is the same test you used for the two benchmarking?
Could you share a sample with us so we can check it further internally?

Thanks.

quan.luo.101 · November 6, 2025, 4:44pm

Thanks for replying.

Yes, the results are using the same script only changing the cuBLAS option from CUBLAS_COMPUTE_16F to CUBLAS_COMPUTE_32F.

Here’s the script I used

// tc_gemm_fp16_acc16.cu
// Tensor Core FP16 GEMM benchmark using cuBLAS with **FP16 accumulation**.
//
// Build:
//   nvcc -O3 -std=c++17 tc_gemm_fp16_acc16.cu -o tc_gemm_fp16_acc16 -lcublas
//
// Run examples:
//   ./tc_gemm_fp16_acc16 --m=4096 --n=4096 --k=4096 --iters=200 --warmup=20
//
// Notes:
// - FP16 inputs/outputs with FP16 accumulation (CUBLAS_COMPUTE_16F).
// - Requests Tensor Cores via CUBLAS_GEMM_DEFAULT_TENSOR_OP and TENSOR_OP_MATH.
// - TFLOP/s = (2*M*N*K * iters) / elapsed_time / 1e12.
// - Sizes that are multiples of 128/256 often map best to TC tiles.

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <string>
#include <iostream>
#include <iomanip>
#include <cuda_fp16.h>

#define CUDA_CHECK(call) do { \
  cudaError_t err = (call); \
  if (err != cudaSuccess) { \
    fprintf(stderr, "CUDA error %s:%d: %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \
    std::exit(EXIT_FAILURE); \
  } \
} while (0)

#define CUBLAS_CHECK(call) do { \
  cublasStatus_t st = (call); \
  if (st != CUBLAS_STATUS_SUCCESS) { \
    fprintf(stderr, "cuBLAS error %s:%d: status %d\n", __FILE__, __LINE__, (int)st); \
    std::exit(EXIT_FAILURE); \
  } \
} while (0)

struct Args {
  int m = 8192, n = 8192, k = 8192;
  int iters = 200;
  int warmup = 20;
  bool column_major = false; // default: treat buffers as row-major and use transpose trick
};

Args parse(int argc, char** argv) {
  Args a;
  for (int i=1;i<argc;++i) {
    std::string s(argv[i]);
    auto val = [&](const char* k)->const char* {
      if (s.rfind(k,0)==0) { const char* p = s.c_str()+strlen(k); if (*p=='=') ++p; return p; }
      return nullptr;
    };
    if (s=="--help"||s=="-h") {
      std::cout << "Usage: " << argv[0]
                << " [--m=M] [--n=N] [--k=K] [--iters=N] [--warmup=N] [--colmajor]\n";
      std::exit(0);
    }
    if (s.rfind("--m",0)==0) a.m = std::stoi(val("--m"));
    else if (s.rfind("--n",0)==0) a.n = std::stoi(val("--n"));
    else if (s.rfind("--k",0)==0) a.k = std::stoi(val("--k"));
    else if (s.rfind("--iters",0)==0) a.iters = std::stoi(val("--iters"));
    else if (s.rfind("--warmup",0)==0) a.warmup = std::stoi(val("--warmup"));
    else if (s=="--colmajor") a.column_major = true;
  }
  return a;
}

// Initialize FP16 buffer with pseudo-random values (GPU-side)
__global__ void init_half(__half* data, size_t n, unsigned long long seed) {
  size_t i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < n) {
    unsigned long long x = seed ^ (1469598103934665603ull + i*1099511628211ull);
    x ^= x >> 12; x ^= x << 25; x ^= x >> 27;
    float f = (float)(x & 0xFFFF) / 65535.0f - 0.5f; // [-0.5,0.5]
    data[i] = __float2half(f);
  }
}

int main(int argc, char** argv) {
  Args a = parse(argc, argv);

  int dev=0; CUDA_CHECK(cudaGetDevice(&dev));
  cudaDeviceProp prop{}; CUDA_CHECK(cudaGetDeviceProperties(&prop, dev));
  std::cout << "Device: " << prop.name << "\n";
  std::cout << "GEMM size: M=" << a.m << " N=" << a.n << " K=" << a.k << "\n";
  std::cout << "Iters: " << a.iters << " Warmup: " << a.warmup << "\n";

  const size_t elemsA = (size_t)a.m * a.k;
  const size_t elemsB = (size_t)a.k * a.n;
  const size_t elemsC = (size_t)a.m * a.n;

  __half* dA=nullptr; __half* dB=nullptr; __half* dC=nullptr;
  CUDA_CHECK(cudaMalloc(&dA, elemsA * sizeof(__half)));
  CUDA_CHECK(cudaMalloc(&dB, elemsB * sizeof(__half)));
  CUDA_CHECK(cudaMalloc(&dC, elemsC * sizeof(__half)));

  // Initialize A,B; zero C
  {
    dim3 blk(256);
    dim3 grdA((unsigned)((elemsA + blk.x - 1) / blk.x));
    dim3 grdB((unsigned)((elemsB + blk.x - 1) / blk.x));
    init_half<<<grdA, blk>>>(dA, elemsA, 0xA5A5A5A5ull);
    init_half<<<grdB, blk>>>(dB, elemsB, 0x5A5A5A5Aull);
    CUDA_CHECK(cudaMemset(dC, 0, elemsC * sizeof(__half)));
    CUDA_CHECK(cudaDeviceSynchronize());
  }

  cublasHandle_t handle; CUBLAS_CHECK(cublasCreate(&handle));
  CUBLAS_CHECK(cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH));

  // Scalars in FP16 (compute=16F requires alpha/beta as FP16)
  __half alpha = __float2half(1.0f);
  __half beta  = __float2half(0.0f);

  // Layout handling (cuBLAS is column-major)
  cublasOperation_t opA = CUBLAS_OP_N, opB = CUBLAS_OP_N;
  int lda, ldb, ldc;
  int m, n, k;

  if (a.column_major) {
    m = a.m; n = a.n; k = a.k;
    lda = a.m; ldb = a.k; ldc = a.m;
    opA = CUBLAS_OP_N; opB = CUBLAS_OP_N;
  } else {
    // Row-major buffers: compute C^T = B^T * A^T
    m = a.n; n = a.m; k = a.k;          // swap m<->n
    lda = a.k; ldb = a.n; ldc = a.n;
    opA = CUBLAS_OP_T; opB = CUBLAS_OP_T;
  }

  // Warm-up
  for (int w=0; w<a.warmup; ++w) {
    CUBLAS_CHECK(cublasGemmEx(
      handle,
      opB, opA,
      m, n, k,
      &alpha,
      dB, CUDA_R_16F, ldb,
      dA, CUDA_R_16F, lda,
      &beta,
      dC, CUDA_R_16F, ldc,
      CUBLAS_COMPUTE_16F, CUBLAS_GEMM_DEFAULT_TENSOR_OP));
  }
  CUDA_CHECK(cudaDeviceSynchronize());

  // Timed loop
  cudaEvent_t start, stop;
  CUDA_CHECK(cudaEventCreate(&start));
  CUDA_CHECK(cudaEventCreate(&stop));
  CUDA_CHECK(cudaEventRecord(start));
  for (int it=0; it<a.iters; ++it) {
    CUBLAS_CHECK(cublasGemmEx(
      handle,
      opB, opA,
      m, n, k,
      &alpha,
      dB, CUDA_R_16F, ldb,
      dA, CUDA_R_16F, lda,
      &beta,
      dC, CUDA_R_16F, ldc,
      CUBLAS_COMPUTE_16F, CUBLAS_GEMM_DEFAULT_TENSOR_OP));
  }
  CUDA_CHECK(cudaEventRecord(stop));
  CUDA_CHECK(cudaEventSynchronize(stop));

  float ms=0.0f;
  CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
  double seconds = ms / 1e3;

  // FLOPs per GEMM = 2*M*N*K (FMA = 2 operations)
  const double flops_per = 2.0 * (double)a.m * (double)a.n * (double)a.k;
  const double total_flops = flops_per * (double)a.iters;
  const double tflops = total_flops / seconds / 1e12;

  std::cout << std::fixed << std::setprecision(2);
  std::cout << "Elapsed: " << ms << " ms for " << a.iters << " GEMMs\n";
  std::cout << "Throughput: " << tflops << " TFLOP/s (FP16 inputs, FP16 accumulate, TC)\n";

  // Cleanup
  CUDA_CHECK(cudaEventDestroy(start)); CUDA_CHECK(cudaEventDestroy(stop));
  CUBLAS_CHECK(cublasDestroy(handle));
  cudaFree(dA); cudaFree(dB); cudaFree(dC);
  return 0;
}

AastaLLL · November 10, 2025, 5:05am

Hi,

Thanks for sharing the source.
We will discuss this issue internally and update you with more information later.

Thanks.

AastaLLL · November 17, 2025, 5:33am

Hi,

Is your goal to benchmark or just want to know more about the nvjet kernel?

For benchmark, could you try CUTLASS instead of cuBLAS?
We usually can get better results with CUTLASS compared to cuBLAS and Triton.

Thanks.

quan.luo.101 · November 17, 2025, 4:55pm

Sure I can use CUTLASS.

But I want to confirm that this is NOT because Thor does worse on single-precision inference than mixed-precision.

AastaLLL · November 19, 2025, 4:42am

Hi,

What is the configuration you used?
We try the sample code with the default setting but the output is different from yours:

CUBLAS_COMPUTE_32F: 67.36 TFLOP/s
CUBLAS_COMPUTE_16F: 43.77 TFLOP/s

Thanks.

quan.luo.101 · November 19, 2025, 4:28pm

I just fixed the clock frequency to MAXN and used the codes I provided above.

Did you turn out to use CUTLASS kernel or Nvjet kernel? In my case, if nvjet kernel is used, the TFLOPs is much higher.

AastaLLL · November 20, 2025, 11:12am

Hi,

Thanks, we will double-check it.
What matrix size do you use?

We use the default setting:

GEMM size: M=8192 N=8192 K=8192

Thanks.

quan.luo.101 · November 20, 2025, 5:01pm

I used the exact command in the code comments. So it’s M = N = K = 4096

AastaLLL · December 8, 2025, 7:49am

Hi,

Sorry for the late update.

We run the benchmark again, but are still not able to reproduce the 160TFLOPs with FP32. (FP32: 96 vs FP16: 70 in our test*)
Are there any missing items in our modification below:

$ git diff
diff --git a/tc_gemm_fp16_acc16.cu b/tc_gemm_fp16_acc16.cu
index 8b9c1f9..10cc81c 100644
--- a/tc_gemm_fp16_acc16.cu
+++ b/tc_gemm_fp16_acc16.cu
@@ -143,7 +143,7 @@ int main(int argc, char** argv) {
       dA, CUDA_R_16F, lda,
       &beta,
       dC, CUDA_R_16F, ldc,
-      CUBLAS_COMPUTE_16F, CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+      CUBLAS_COMPUTE_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP));
   }
   CUDA_CHECK(cudaDeviceSynchronize());
 
@@ -162,7 +162,7 @@ int main(int argc, char** argv) {
       dA, CUDA_R_16F, lda,
       &beta,
       dC, CUDA_R_16F, ldc,
-      CUBLAS_COMPUTE_16F, CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+      CUBLAS_COMPUTE_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP));
   }
   CUDA_CHECK(cudaEventRecord(stop));
   CUDA_CHECK(cudaEventSynchronize(stop));

We only change the compute precision from FP16 to FP32. Is this enough?
Thanks.

quan.luo.101 · December 8, 2025, 11:55pm

Yes, that’s all I did.

In your testing, the FP32 acc performance is also better than FP16. Is that expected?

AastaLLL · December 10, 2025, 9:12am

Hi,

We try to reproduce the nvjet issue internally.
But we are checking the FP16 perf issue with our internal team first.

Thanks.

AastaLLL · December 11, 2025, 4:31am

Hi,

The issue is fixed in the newer cuBLAS package.
Please try to install the recent CUDA 13.1 release:

Below is the performance we got from the new library:

M=4096 N=4096 K=4096
FP32: 143.31 TFLOP/s
FP16: 167.72 TFLOP/s

M=8192 N=8192 K=8192
FP32: 78.78 TFLOP/s
FP16: 114.64 TFLOP/s

Thanks

system · December 30, 2025, 7:36am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance Benchmarking on Jetson Thor Jetson Thor cublas	7	952	December 2, 2025
Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s) Jetson Thor cudnn , cublas	22	586	January 21, 2026
Thor torch.mm benchmark results (float32/float16/float8_e3m2fn) Jetson Thor cuda , pytorch , benchmarks	5	327	September 15, 2025
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2371	June 8, 2021
2D-FFT Benchmarks on Jetson AGX with various precisions Jetson AGX Xavier cuda	6	3073	October 18, 2021
HPL on cuBlas : Ok, but not on Tesla 1060 Board ! Tesla board crash on large array when launchin CUDA Programming and Performance	11	30530	December 20, 2009
cuBLAS kernels always run serially despite streams and AsyncMemCpy?!? CUDA Programming and Performance	17	6022	September 30, 2015
How to benchmark on Thor to get the real FP4/FP8 performance TFOPS Jetson Thor nvbugs , benchmarks	9	315	January 8, 2026
NVFP4 Performance Issue Jetson Thor llm	11	252	January 7, 2026
Reduced CuBLAS performance on a particular problem size? GPU-Accelerated Libraries	0	443	October 13, 2020

Conditions on NVJet kernels on Jetson Thor

Related topics