Performance Benchmarking on Jetson Thor

quan.luo.101 · October 29, 2025, 11:46pm

Is there a standard way to reproduce the claimed theoretical results of Jetson Thor?

I tried to use CuBLAS for matmal and can only achieve 130 TFLOPs (500 TFLOPs claimed) on FP16.

I also tried to write a kernel to purely do the memory copying and can only achieve 200GB/s memory bandwidth.

AastaLLL · October 30, 2025, 6:25am

Hi,

We recommended using the LLMs for benchmarking.

Matrix multiplication needs to read/write memory frequently and might be impacted by the memory bandwidth.
Please find the Thor benchmark results below:

An example to reproduce the LLM perf is shared in the topic below:

Thanks.

quan.luo.101 · October 30, 2025, 4:18pm

I did the LLM perf, but I want deeper understanding on the device.

I tried to use nsight compute to look into the device. So basically the 500 TFLOPs is achieved with L1 cache and sparsity?

The actual max computing bound for non-sparsed matrices OFF is ~225 TFLOPs?

AastaLLL · November 3, 2025, 6:04am

Hi

Do you mean FP32 TFLOPs?
If yes, the perf of Thor’s CUDA Cores is:

MAXN: 8.064
120W: 7.096

You can find more details in the document below:

Thanks.

quan.luo.101 · November 3, 2025, 4:43pm

Sorry I didn’t see any TFLOPs data in the data sheet. I read this from the keynote from nvidia in reddit release here: Reddit - The heart of the internet.

I think the 500 / 1000 / 2000 TFLOPs for fp16, fp8 and fp4 are tensor core performances.

Could you plz confirm that the 7.8 TFLOPs is for cuda core or tensor core for FP32? If it’s for cuda core, what’s the tensor core performance for FP32?

quan.luo.101 · November 4, 2025, 5:09pm

I tried a simple CuBLAS GeMM on 4090 and it can saturate the claimed tensor core TFLOPS (280 / 330 TFLOPs) easily. Why is it hard on Thor? Is there a standard way to get the expected performance (500 TFLOPs, I can only get ~130 TFLOPs) on it?

AastaLLL · November 5, 2025, 5:15am

Hi,

Please download the “Jetson T5000 Series Modules Data Sheet” file and you can find the information on page 1.
7.096 FP32 TFLOPs is the performance of CUDA cores.

Tensor Core (fifth-generation) supports TensorFloat-32 (TF32), bfloat16, FP16, FP8, FP4, and INT8.
It doesn’t support FP32.

Thanks.

Topic		Replies	Views
Thor torch.mm benchmark results (float32/float16/float8_e3m2fn) Jetson Thor cuda , pytorch , benchmarks	4	459	September 15, 2025
【Jetson Thor】Cutlass FP4/FP8/FP16 Performance Test Jetson Thor cuda	14	273	June 15, 2026
Conditions on NVJet kernels on Jetson Thor Jetson Thor cublas	13	456	December 11, 2025
Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s) Jetson Thor cudnn , cublas	21	931	January 5, 2026
How the 2070 TFLOPs of Jetson AGX Thor(T5000) is calculated? Jetson Thor	15	1256	September 23, 2025
How to benchmark on Thor to get the real FP4/FP8 performance TFOPS Jetson Thor nvbugs , benchmarks	10	535	March 16, 2026
Comparing AI Performance of DGX Spark to Jetson Thor DGX Spark / GB10	6	14986	September 5, 2025
Clarification on CUDA Core and Tensor Core counts for Jetson AGX Thor Jetson Thor cuda , gpu	6	612	January 19, 2026
Performance Comparison of Qwen3-30B-A3B-AWQ on Jetson Thor vs Orin AGX 64GB Jetson Thor generative_ai	9	1754	September 25, 2025
Using Jetson AGX Thor for LLM Finetuning and got some questions Jetson Thor llm	4	398	January 12, 2026

Performance Benchmarking on Jetson Thor

Related topics