Performance Benchmarking on Jetson Thor

Is there a standard way to reproduce the claimed theoretical results of Jetson Thor?

I tried to use CuBLAS for matmal and can only achieve 130 TFLOPs (500 TFLOPs claimed) on FP16.

I also tried to write a kernel to purely do the memory copying and can only achieve 200GB/s memory bandwidth.

Hi,

We recommended using the LLMs for benchmarking.

Matrix multiplication needs to read/write memory frequently and might be impacted by the memory bandwidth.
Please find the Thor benchmark results below:

An example to reproduce the LLM perf is shared in the topic below:

Thanks.

I did the LLM perf, but I want deeper understanding on the device.

I tried to use nsight compute to look into the device. So basically the 500 TFLOPs is achieved with L1 cache and sparsity?

The actual max computing bound for non-sparsed matrices OFF is ~225 TFLOPs?

Hi

Do you mean FP32 TFLOPs?
If yes, the perf of Thor’s CUDA Cores is:

MAXN: 8.064
120W: 7.096

You can find more details in the document below:

Thanks.

Sorry I didn’t see any TFLOPs data in the data sheet. I read this from the keynote from nvidia in reddit release here: Reddit - The heart of the internet.

I think the 500 / 1000 / 2000 TFLOPs for fp16, fp8 and fp4 are tensor core performances.

Could you plz confirm that the 7.8 TFLOPs is for cuda core or tensor core for FP32? If it’s for cuda core, what’s the tensor core performance for FP32?

I tried a simple CuBLAS GeMM on 4090 and it can saturate the claimed tensor core TFLOPS (280 / 330 TFLOPs) easily. Why is it hard on Thor? Is there a standard way to get the expected performance (500 TFLOPs, I can only get ~130 TFLOPs) on it?

Hi,

Please download the “Jetson T5000 Series Modules Data Sheet” file and you can find the information on page 1.
7.096 FP32 TFLOPs is the performance of CUDA cores.

Tensor Core (fifth-generation) supports TensorFloat-32 (TF32), bfloat16, FP16, FP8, FP4, and INT8.
It doesn’t support FP32.

Thanks.