About tensor core's flops/clk and wmma shape?

I read this blog: Programming Tensor Cores in CUDA 9 | NVIDIA Technical Blog

Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock.
The FP16 multiply results in a full-precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply

Q1: Is it same for every generation of Tensor core to perform 64 FFMA/clck, if not, is there any link to let me get the detailed information about FFMAs per clock for each TC and its shape?

During program execution, multiple Tensor Cores are used concurrently by a full warp of execution. The threads within a warp provide a larger 16x16x16 matrix operation to be processed by the Tensor Cores.

Q2: From programmer view, wmma provide 16x16x16 matrix operation, is that same for all CUDA versions?

No its not the same. This post covers some related calculations and links. I don’t have any comments about the relationship between throughput and shape, but if I were guessing, I would assume it would be the (inverse) ratio of the flops per op, times the base throughput (TC flops/clk)

If it were me, I wouldn’t tie this to “CUDA versions” which to me means things like CUDA 10.x, CUDA 11.x, etc. i.e. software versions. I don’t know of any connection between shape support and CUDA software version per-se (although there is an obvious connection, because supporting newer GPU architectures requires newer minimum CUDA versions.)

The connection is to GPU architecture primarily, AFAIK. So newer architectural generations may offer additional shapes, compared to older architectural generations. To get a sense of this, you could take one of the matrix-multiply instructions at the PTX level, and see what shapes were “introduced” based on architectural generation. For example the mma instruction, scroll down to to the “Target ISA Notes”. We can see in general such instructions required cc7.0 minimum. For that particular instruction, it appears that the only shape offered for cc7.0 was 8x8x4. For that instruction, cc7.5 (Turing) added a 16x8x8 shape/option for f16 work, to pick one example.