Does SM have more FP units than those "cuda cores"?

BskyBlue · April 25, 2024, 9:30pm

Have a compute-intensive cuda kernel that only do 32-bit FP fmad instructions

RTX3070 TI
→ 48 SM
→ 128 CUDA cores / SM
→ Total of 6144 CUDA cores

Interestingly, the result shows that each SM seems to have more FP units that can execute CUDA fmad instructions

I assume each CUDA core executes each different thread’s FP instructions
Therefore, as RTX3070TI has total 6144 cuda cores, I expect that when the total #threads with #blocks exceed 6144, the execution time will double, however, the execution time jumps after 24576 threads. I have other similar kernel that also shows the same number deterministically

(Q) Inside SM, does additional FP execution units exist that execute CUDA fmad instructions ?

Just looking at the numbers,

(128 + x) x 48 = 24576
x = 384 additional FP execution units per SM?

By the way, I found this in the manual

It sounds like each SM has additional 168 FP64 units?
Does this translate into effective 168 x 2 = 336 FP32 execution units?

336 is not quite close to 384 I observed, but I’d like to find out if someone could share what’s occuring

Thanks !

Robert_Crovella · April 25, 2024, 9:37pm

There aren’t additional CUDA cores. deviceQuery should accurately identify the number of CUDA cores, i.e. the number of FP32 units. However if you are issuing dependent work, then subsequent FFMA/FMUL/FADD instructions may not issue if there are previous dependencies/results not yet satisfied. If you then increase threads (which to me means issuing independent work, but I acknowledge you’ve not described what you are doing to that extent), then you may well see an increase in throughput. GPUs generally like “more threads” for this reason (latency hiding).

no.

Curefab · April 27, 2024, 10:18am

Please read: The 168 units are for GPUs with 84 SMs. 2 per SM.

So your 48 SMs yield 3072 FP32, 3072 mixed FP32/INT32 and 96 FP64 cores.

Or 64 FP32, 64 mixed FP32/INT32 and 2 FP64 per SM.
The FP64 units are shared between the SM partitions, whereas the other units (a quarter each) are specific to one of the 4 SM partitions per SM.

You also get Tensor Cores, which can execute FMA instructions on up to TF32 types (which have 19 bits).
For INT32 you have an additional half core per SM for the uniform datapath (the name core means 1 pipelined instruction per clock bandwidth, half core means 2 clock cycles per math instruction bandwidth). It is mostly used for calculating loop variables and memory offsets, which are the same over the 32 threads of a warp.

FP64, uniform datapath and tensor cores have to be accessed specifically (with different instructions), they do not execute the FP32 or INT32 fmad instructions.

Topic		Replies	Views
Understanding of Tensor Core, Cuda Core and other cores in Ampere architecture CUDA Programming and Performance tensorrt , cuda	8	3109	December 3, 2022
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19410	July 5, 2011
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	996	December 8, 2018
How does four subcores in a single SM share two FP64 cores? CUDA Programming and Performance	1	41	September 10, 2024
Updating my records CUDA Programming and Performance	7	6663	February 7, 2011
Optimal threads vs blocks CUDA Programming and Performance	4	4017	April 24, 2011
Basic Cuda Confusion - help CUDA Programming and Performance	9	1893	February 11, 2013
Inquisitive about SP cores in SMs CUDA Programming and Performance	3	1405	October 1, 2009
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28556	July 4, 2019
Quadro 2000M spec's Number of cores CUDA Programming and Performance	3	3235	June 7, 2012

Does SM have more FP units than those "cuda cores"?

Related topics