Parallel Matrix Multiplication in Cuda - A Question about Threads/Blocks and Tensor Cores

Hey there :)

I am just getting into Cuda Programming and did the Matrix Multiplication for practice. In my naive code i used one thread to calculate one element of the result matrix (4x4). However many example codes do this calculation for 1024x1024 matrices in the same manner.

  ( 6) Multiprocessors, ( 64) CUDA Cores/MP:     384 CUDA Cores
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)

From my understanding this shouldnt be possible because i would need 1024x1024 threads but the Xavier NX supports only 6x2048 threads. Also if there is a limit of 1024 threads per block, how can the maximum dimension of thread block be 1024x1024 in 2d? Is there even a maximum dimension for parallel matrix multiplication? Does anybody know how to use the Tensor Cores in order to speed up the calculation? Is there any difference in speed when using different variable types (e.g. 64bit int compared to 64bit double or float)?


Sorry for the late response, may I know if this is still an issue to clarify?


It’s recommended to check our matrix multiplication sample first:


In general, we don’t launch a task with huge threads but multiple sub-tasks with hardware friendly #thread.
This means you don’t need 1024x1024 threads to do everything at once, but smaller threads group ( ex. 32 ) instead.

Tensor core requires INT8 data. You can also find an example in our CUDA sample folder: