Understanding of Tensor Core, Cuda Core and other cores in Ampere architecture

wmindramalw · December 1, 2022, 12:37pm

I try to understand tensor core, cuda core and other in Ampere architecture.

According to the below photo, there is a clearly shown tensor core.

What are Cuda cores in that photo?
There is no TF32, is that mean there are no CUDA cores?
INT32, FP32 and FP64 are memories. So those are also called CUDA cores?
TensorRT SDK use that Tensor Cores shown in the photo? If so, is there INT8 and FP16 memories inside the tensor core?

Is it possible to check each of data type available memories using nvidia-ml-py3 or nvidia-ml?

rs277 · December 1, 2022, 6:51pm

Most of them. In common use, the term “cuda core”, as in “the RTX 3080 has 8704 Cuda cores”, refers to the number FP32 cores present.

If you read this section of the Programming Guide in conjunction with this illustration, the situation should become clearer

Curefab · December 1, 2022, 11:23pm

None of the boxes are memories except the register file (and the instruction cache in some way).

The INT32 units do integer calculations, the FP32 and FP64 units floating-point calculations. LD/ST are load-store units, the SFU calculates special functions (e.g. sin/cos).

njuffa · December 1, 2022, 11:57pm

To emphasize what @rs277 said: “CUDA core” is a marketing term, not a technical term.

wmindramalw · December 2, 2022, 1:45am

is that mean INT32, FP32 and FP64 are CUDA cores?

wmindramalw · December 2, 2022, 1:46am

Then what are core references to CUDA core? INT32, FP32 and FP64 + Tensor Core?

njuffa · December 2, 2022, 1:54am

As @rs277 already explained, when people speak of a GPU with n “CUDA cores” they mean a GPU with n FP32 cores, each of which can perform one single-precision fused multiply-add operation (FMA) per cycle. The number of “CUDA cores” does not indicate anything in particular about the number of 32-bit integer ALUs, or FP64 cores, or multi-function units, or “Tensor cores” (which I would also consider a marketing term).

If you want to know the throughput for various classes of operations, consult the CUDA Programming Guide rather than nice-looking block diagrams put out by the (technical) marketing guys.

rs277 · December 2, 2022, 2:18am

Yes, but in isolation that does not mean very much. The number of each of these units dictates, (speaking broadly), how fast the SM can process a given instruction.

So if you wish to add INT32’s, each SM can process 4 x 16 = 64 per processor cycle. With only half the number of FP64 cores, this throughput halves to only 32.

Table 3 gives this information across the different GPU generations.

wmindramalw · December 3, 2022, 4:45am

thank you very much for information.

Topic		Replies	Views
Does SM have more FP units than those "cuda cores"? CUDA Programming and Performance cuda , architecture-and-design	2	374	April 27, 2024
Question about tensor cores performance CUDA Programming and Performance	3	557	October 12, 2021
Can I get the number of Tensor cores of my GPU? CUDA Programming and Performance cuda	9	3923	December 28, 2022
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1035	April 26, 2018
Nsight Profile of NVIDIA/CUDALibrarySamples/cuTENSOR. Does it use CUDA Programming and Performance	4	501	November 22, 2022
Tensor core architecture deep-dive any whitepaper blog available? GPU-Accelerated Libraries cudnn , cublas	1	746	February 20, 2024
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2498	August 12, 2017
CUDA intrinsics? CUDA Programming and Performance	7	3410	November 16, 2017
How to get Nsight Compute timeline of tensor cores and cuda cores? Nsight Compute cuda , kernel	5	660	April 16, 2024
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	1323	May 15, 2024

Understanding of Tensor Core, Cuda Core and other cores in Ampere architecture

Related topics