Need help in implementing Matrix multiplication using Shared Memory in Numba

Hi All,
The code given Examples — Numba 0.52.0.dev0+274.g626b40e-py3.7-linux-x86_64.egg documentation which consists of the blocked algorithm computing matrix multiplication, there seems to be a discrepancy when computing the result for matrices with unequal dimensions for example (1024,512) x (512,1024). Has anyone found any similar issues

Thanks