Cuda tiling in 3D grids and 3D blocks with shared memory

Hi, I am a beginner of cuda, I want to apply cuda in a matrix-matrix multiplication, the algorithm to be optimized is as follows,
for (int a = 0; a < N; a++)
for (int b = 0; b < N; b++)
for (int c = 0; c < N; c++)
for (int d = 0; d < N; d++)
sum[a][b][c] = sum[a][b][c] + A[a][b][d] * C[d][c];
but I can’t use it very well, I try to use tiling and shared memory in the code, below is the cuda code that I have written,

but I am not quite sure if this is correct. Can someone help me?

Hello, this forum is dedicated to discussions related to using the sanitizer tools and API.
Questions related to CUDA can be raised at CUDA - NVIDIA Developer Forums