Clarification on CUDA Core and Tensor Core counts for Jetson AGX Thor

Hello NVIDIA team,

I am working with a Jetson AGX Thor developer kit and would like to clarify an apparent ambiguity in publicly stated GPU specifications, specifically regarding CUDA core and Tensor Core counts.

According to the Thor SoC TRM, the Blackwell GPU organization is described as follows:

  • Up to 3 GPCs, each with 4 TPCs (12 TPCs total)

  • Each TPC contains 2 SMs (up to 24 SMs total)

  • Each SM has 128 CUDA cores (3,072 CUDA cores in the maximum configuration)

  • Each SM is partitioned into four processing blocks, with each block containing a 5th-generation Tensor Core (up to 96 Tensor Cores total)

Using the CUDA Driver API on the device, I queried the runtime GPU properties and obtained the following:

  • GPU name: NVIDIA Thor

  • SM count: 20

  • L2 cache size: 32 MiB

  • Max shared memory per SM: 228 KiB

Based on the runtime-reported SM count:

  • CUDA cores = 20 SM Ă— 128 CUDA cores = 2,560 CUDA cores
    (this matches several published specifications, but differs from the maximum configuration described in the TRM)

Tensor Core counting, however, is where ambiguity arises:

  • At the SM (logical / software-visible) level: 20 Tensor Cores (1 per SM)

  • At the micro-architectural level: 20 SM Ă— 4 Tensor Core units per SM = 80 physical Tensor Core units

  • Some external sources reference “96 Tensor Cores,” which appears to correspond to the maximum 24-SM configuration (24 Ă— 4)

Could you please clarify the following:

  1. What is the recommended and correct way to report Tensor Core counts for Jetson AGX Thor Dev Board (SM-level logical resources vs internal micro-architectural execution units)?

  2. When NVIDIA documentation or marketing material refers to “96 Tensor Cores,” does this explicitly refer to the maximum SKU and count micro-architectural Tensor Core units?

  3. For technical publications and performance analysis, is it accurate to describe Tensor Core resources as “one logical Tensor Core per SM, implemented internally as four Tensor Core execution units”?

This clarification would be very helpful for ensuring accurate and consistent reporting in academic and technical work.

Thank you for your time and clarification.

1 Like

Hi,

Could you share which document/spec you are referring to?
Based on our data sheet below, the Thor devkit should have 2560 CUDA cores instead of 3072.

Thanks.

I have the same question.
2560 cuda cores / 128 cores(sm) = 20SMs
For blackwell gpus, 4 tensor cores / SM. * 20 SMs = 80 Tensor Cores
But the data sheet states, T5000 has 96 Tensor Cores.
Please clarify on that.

Hi,

Based on our document, Thor has

2560 NVIDIA® CUDA® cores

We are checking the details with our internal team.
Will get back to you with more information.

Thanks.

Hi,

For Jetson products, future documentation versions will not specify internal Tensor Core counts or microarchitectural unit breakdowns.
Instead, the focus is on end‑to‑end AI performance metrics such as TOPS/TFLOPS and workload benchmarks, which better represent real application behavior on the platform.

Thanks.

Thor-Soc-TRM_DP-11881-002_v1.1.pdf (15.8 MB)

Hi,

Thank you for the clarification regarding the CUDA core count on the Jetson AGX Thor developer kit. That resolves the discrepancy between the maximum architectural configuration and the devkit-specific implementation.

To clarify the source of my earlier interpretation, I was referring to the Thor SoC Technical Reference Manual (attached here), which states:

a) “The Thor™ SoC has an NVIDIA Blackwell GPU with up to three Graphics Processing Clusters (GPCs).”
b) “Each GPC features up to four Texture Processing Clusters (TPCs), each consisting of two Streaming Multiprocessors (SMs).”
c) “The Blackwell Streaming Multiprocessor (SM) has 128 CUDA cores and is partitioned into four processing blocks, each containing a 5th-generation Tensor Core.”

Based on this description, my initial understanding was that the maximum architectural configuration corresponds to 3 GPCs, 12 TPCs, and 24 SMs (3,072 CUDA cores), whereas the Jetson AGX Thor devkit implements a reduced configuration with 20 SMs (2,560 CUDA cores), as you confirmed.

One remaining point I would appreciate clarification on concerns references to “96 Tensor Cores” that appear in some Thor-related materials. Given that the Jetson AGX Thor devkit reports 20 SMs at runtime, and that the TRM describes each SM as being internally partitioned into four Tensor Core execution blocks, it is unclear how the figure of 96 Tensor Cores should be interpreted for the devkit configuration.

To ensure accurate documentation, could you please confirm whether:

  • the “96 Tensor Cores” figure refers to the maximum Thor GPU architectural configuration (e.g., 24 SM Ă— 4 Tensor Core execution units), rather than the Jetson AGX Thor devkit specifically; and

  • for Jetson platforms, Tensor Core resources are not intended to be reported or scaled at the SM level, but instead reflected through aggregate performance metrics (TOPS/TFLOPS).

I understand and appreciate the direction toward emphasizing end-to-end AI performance metrics in future Jetson documentation. For research and system-level analysis use cases, however, a brief clarification distinguishing maximum architectural capability from devkit-specific configuration would be extremely helpful to avoid ambiguity when interpreting the TRM.

Thank you again for your time and clarification.

Best,
Amir