Hey there :)

I am just getting into Cuda Programming and did the Matrix Multiplication for practice. In my naive code i used one thread to calculate one element of the result matrix (4x4). However many example codes do this calculation for 1024x1024 matrices in the same manner.

```
( 6) Multiprocessors, ( 64) CUDA Cores/MP: 384 CUDA Cores
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
```

From my understanding this shouldnt be possible because i would need 1024x1024 threads but the Xavier NX supports only 6x2048 threads. Also if there is a limit of 1024 threads per block, how can the maximum dimension of thread block be 1024x1024 in 2d? Is there even a maximum dimension for parallel matrix multiplication? Does anybody know how to use the Tensor Cores in order to speed up the calculation? Is there any difference in speed when using different variable types (e.g. 64bit int compared to 64bit double or float)?

Thanks!