Tensor core architecture deep-dive any whitepaper blog available?

i learned a bit about tensor core programming using beginner example using cudnn, cublas and wmma APIs. However, in nvidia blogs, while plenty of software programming information available, hardware information seems far and few between.
WHle the classic CUDA has plethora of literature avialable (books, blogs and online tutorials) about architecture down to detail: (down to how cache, L1/L2 etc), not so much with tensor core. Few blogs I read over, clearly has some GIFstyle animation that it can process certain matrixes up to 16 times faster.
This leads to following questions like:

  • are there any materials available online on tensor core hardware architecture?
  • is tensor core is completely separate IP on GPU IC than cuda core or meshed together with cuda core or something cuda cores are reconfigured to become tensor core?

You will find references to tensor cores in the original blog that introduced them, in various other blogs, in each of the whitepapers for GPUs with tensor cores (e.g. v100), and of course in the CUDA programming guide.

Like most things related to CUDA GPU internals, tensor core hardware architecture is not documented to the nth degree.

Both tensor cores and CUDA cores represent functional units in a modern CUDA SM, which is the building block of any CUDA GPU. A functional unit means a hardware entity that supports execution of a particular set of instructions. CUDA cores support execution of single-precision floating point add, multiply, multiply-add. All other instructions are handled by other functional units in the SM, not by the CUDA cores. Tensor cores handle instructions related to tensor core ops, the PTX mma, wmma and wgmma instructions (and their SASS counterparts).

There is no overlap or direct connection between CUDA cores and tensor cores.