Does SM have more FP units than those "cuda cores"?

Have a compute-intensive cuda kernel that only do 32-bit FP fmad instructions

RTX3070 TI
→ 48 SM
→ 128 CUDA cores / SM
→ Total of 6144 CUDA cores

Interestingly, the result shows that each SM seems to have more FP units that can execute CUDA fmad instructions

I assume each CUDA core executes each different thread’s FP instructions
Therefore, as RTX3070TI has total 6144 cuda cores, I expect that when the total #threads with #blocks exceed 6144, the execution time will double, however, the execution time jumps after 24576 threads. I have other similar kernel that also shows the same number deterministically

(Q) Inside SM, does additional FP execution units exist that execute CUDA fmad instructions ?

Just looking at the numbers,

(128 + x) x 48 = 24576
x = 384 additional FP execution units per SM?

By the way, I found this in the manual

It sounds like each SM has additional 168 FP64 units?
Does this translate into effective 168 x 2 = 336 FP32 execution units?

336 is not quite close to 384 I observed, but I’d like to find out if someone could share what’s occuring

Thanks !

There aren’t additional CUDA cores. deviceQuery should accurately identify the number of CUDA cores, i.e. the number of FP32 units. However if you are issuing dependent work, then subsequent FFMA/FMUL/FADD instructions may not issue if there are previous dependencies/results not yet satisfied. If you then increase threads (which to me means issuing independent work, but I acknowledge you’ve not described what you are doing to that extent), then you may well see an increase in throughput. GPUs generally like “more threads” for this reason (latency hiding).

no.

Please read: The 168 units are for GPUs with 84 SMs. 2 per SM.

So your 48 SMs yield 3072 FP32, 3072 mixed FP32/INT32 and 96 FP64 cores.

Or 64 FP32, 64 mixed FP32/INT32 and 2 FP64 per SM.
The FP64 units are shared between the SM partitions, whereas the other units (a quarter each) are specific to one of the 4 SM partitions per SM.

You also get Tensor Cores, which can execute FMA instructions on up to TF32 types (which have 19 bits).
For INT32 you have an additional half core per SM for the uniform datapath (the name core means 1 pipelined instruction per clock bandwidth, half core means 2 clock cycles per math instruction bandwidth). It is mostly used for calculating loop variables and memory offsets, which are the same over the 32 threads of a warp.

FP64, uniform datapath and tensor cores have to be accessed specifically (with different instructions), they do not execute the FP32 or INT32 fmad instructions.