Understanding of Tensor Core, Cuda Core and other cores in Ampere architecture

I try to understand tensor core, cuda core and other in Ampere architecture.

According to the below photo, there is a clearly shown tensor core.

  1. What are Cuda cores in that photo?
  2. There is no TF32, is that mean there are no CUDA cores?
  3. INT32, FP32 and FP64 are memories. So those are also called CUDA cores?
  4. TensorRT SDK use that Tensor Cores shown in the photo? If so, is there INT8 and FP16 memories inside the tensor core?

Is it possible to check each of data type available memories using nvidia-ml-py3 or nvidia-ml?

Most of them. In common use, the term “cuda core”, as in “the RTX 3080 has 8704 Cuda cores”, refers to the number FP32 cores present.

If you read this section of the Programming Guide in conjunction with this illustration, the situation should become clearer

None of the boxes are memories except the register file (and the instruction cache in some way).

The INT32 units do integer calculations, the FP32 and FP64 units floating-point calculations. LD/ST are load-store units, the SFU calculates special functions (e.g. sin/cos).

To emphasize what @rs277 said: “CUDA core” is a marketing term, not a technical term.

is that mean INT32, FP32 and FP64 are CUDA cores?

Then what are core references to CUDA core? INT32, FP32 and FP64 + Tensor Core?

As @rs277 already explained, when people speak of a GPU with n “CUDA cores” they mean a GPU with n FP32 cores, each of which can perform one single-precision fused multiply-add operation (FMA) per cycle. The number of “CUDA cores” does not indicate anything in particular about the number of 32-bit integer ALUs, or FP64 cores, or multi-function units, or “Tensor cores” (which I would also consider a marketing term).

If you want to know the throughput for various classes of operations, consult the CUDA Programming Guide rather than nice-looking block diagrams put out by the (technical) marketing guys.

Yes, but in isolation that does not mean very much. The number of each of these units dictates, (speaking broadly), how fast the SM can process a given instruction.

So if you wish to add INT32’s, each SM can process 4 x 16 = 64 per processor cycle. With only half the number of FP64 cores, this throughput halves to only 32.

Table 3 gives this information across the different GPU generations.

thank you very much for information.