Need of _sync_threads in disjoint set of address access

Hi Guys,
I’m writing a matmul kernel, with different approaches to measure the performance.
Input and kernel config:
Both input matrix A and B are 2048 x 2048 matrices. I have a (32, 32, 0) as grid dimension and (8,8,0) as block dimension.
My kernel setup is as follows:
Each block is responsible for computing 64 x 64 sub matrix of the resultant matrix. Since we have (8 x 8) block, each thread will compute 8 x 8 sub matrix of the resultant matrix.
Kernel also uses shared memory for two 64 x 64 matrices for both matrix A and matrix B.
For instance, since tile size is 8 each thread will load (8 x 2048) sub matrix from the matrix A and (2048 x 8) sub matrix from the matrix B to the shared memory and perform matmul to obtain (8 x 8) sub matrix . Due to the SM size (48 KB) limitation per thread block. I’m splitting the SM loading step into multiple time steps, where in each time step a single thread will load strides of size 64.

__global__ void matmul_2048(const int *A, const int *B, int nrows, 
                     int ncols, int tile_sz, int *result) {
    int global_tidx   = blockIdx.x * blockDim.x + threadIdx.x;
    int global_tidy   = blockIdx.y * blockDim.y + threadIdx.y;
    int thr_mat_range;
    int startRowIdx2D = global_tidx * tile_sz;
    int startColIdx2D = global_tidy * tile_sz;
    
    const int chunk_size = 64;
    __shared__ int matA[chunk_size][chunk_size], matB[chunk_size][chunk_size];
    int chunks = nrows / chunk_size;

    /*
    0. Split rows into chunk_size chunks (aka Timesteps)
    1. Load 64 x 64 submatrix of A into SM matA
    2. Load 64 x 64 submatrix of B into SM matB
    3. Perform 64 x 64 matmul operation and write to global memory.
    */
    for (int i=0; i < chunks; i++) {
      //corresponding start row index of 2D matrix A, the
      //thread is responsible for.
      startRowIdx2D = global_tidx * tile_sz;
      for (int t=0; t < tile_sz; t++) {
        int rowIdx1D  = startRowIdx2D * ncols + i * chunk_size;
        thr_mat_range = threadIdx.x * tile_sz + t;
        for (int j=0; j < chunk_size; j++) {
          matA[thr_mat_range][j] = A[rowIdx1D + j];
        }
        startRowIdx2D++;
      }

      //corresponding start column index of 2D matrix B, the
      //thread is responsible for.      
      startColIdx2D = global_tidy * tile_sz;
      int startRow = i * chunk_size;
      for (int t=0; t < tile_sz; t++) {
        thr_mat_range = threadIdx.y * tile_sz + t;
        for (int j=0; j < chunk_size; j++) {
          int colIdx1D = (startRow + j) * ncols + startColIdx2D;
          matB[j][thr_mat_range] = B[colIdx1D];
        }
        startColIdx2D++;
      }
      //why we need syncthreads for correctness??
      __syncthreads();

      //start target row of the output 2D matrix
      int outrow = global_tidx * tile_sz;
      //perform matmul of 64x64 chunks from both matrix A and B.
      for (int rowt=0; rowt<tile_sz; rowt++) {
        int row = threadIdx.x * tile_sz + rowt;
        //start target column of the output 2D matrix.
        int outcol = global_tidy * tile_sz;
        for (int colt=0; colt<tile_sz; colt++) {
          int col  = threadIdx.y * tile_sz + colt;
          int temp = 0;
          for (int k=0; k < chunk_size; k++) {
            temp += matA[row][k] * matB[k][col];
          }
          result[outrow * ncols + outcol] += temp;
          outcol++;
        }
        outrow++;
      }
    }
    return;
}

Each thread in the entire gird access independent addresses without any overlapping between any two threads. I’m not sure why I need to perform __syncthreads() after loading the sub matrices into shared memory before performing the matmul operation. I think since each thread’s address access is different, I don’t think there is a datarace. I have also run the kernel under compute-sanitizer racecheck tool, but the tool didn’t give any errors.

Interestingly, If I replace the temp with a direct write to the result array, I get more mismatches between expected (CPU) and produced (GPU) values. Delaying the write with an additional register (temp), produces fewer mismatched values.

Could you please help me with the reasoning for the need of __sync_threads for correctness?
Please let me know if I’m making any mistakes here.
Thank you very much,

presumably that claim must be false

test case:

# cat t272.cu
#include <cstdio>

__global__ void matmul_2048(const int *A, const int *B, int nrows,
                     int ncols, int tile_sz, int *result) {
    int global_tidx   = blockIdx.x * blockDim.x + threadIdx.x;
    int global_tidy   = blockIdx.y * blockDim.y + threadIdx.y;
    int thr_mat_range;
    int startRowIdx2D = global_tidx * tile_sz;
    int startColIdx2D = global_tidy * tile_sz;

    const int chunk_size = 64;
    __shared__ int matA[chunk_size][chunk_size], matB[chunk_size][chunk_size];
    int chunks = nrows / chunk_size;

    /*
    0. Split rows into chunk_size chunks (aka Timesteps)
    1. Load 64 x 64 submatrix of A into SM matA
    2. Load 64 x 64 submatrix of B into SM matB
    3. Perform 64 x 64 matmul operation and write to global memory.
    */
    for (int i=0; i < chunks; i++) {
      //corresponding start row index of 2D matrix A, the
      //thread is responsible for.
      startRowIdx2D = global_tidx * tile_sz;
      for (int t=0; t < tile_sz; t++) {
        int rowIdx1D  = startRowIdx2D * ncols + i * chunk_size;
        thr_mat_range = threadIdx.x * tile_sz + t;
        for (int j=0; j < chunk_size; j++) {
          if ((thr_mat_range == 7) && (j == 63) && !i && !blockIdx.x && !blockIdx.y) printf("thread %d,%d is loading matA 7,63\n", threadIdx.x, threadIdx.y);
          if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) printf("LA: %d,%d\n", thr_mat_range, j);
          matA[thr_mat_range][j] = A[rowIdx1D + j];
          if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) matA[thr_mat_range][j] = 1234;
        }
        startRowIdx2D++;
      }

      //corresponding start column index of 2D matrix B, the
      //thread is responsible for.
      startColIdx2D = global_tidy * tile_sz;
      int startRow = i * chunk_size;
      for (int t=0; t < tile_sz; t++) {
        thr_mat_range = threadIdx.y * tile_sz + t;
        for (int j=0; j < chunk_size; j++) {
          int colIdx1D = (startRow + j) * ncols + startColIdx2D;
          if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) printf("LB: %d,%d\n", j, thr_mat_range);
          matB[j][thr_mat_range] = B[colIdx1D];
          if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) matB[j][thr_mat_range] = 4321;
        }
        startColIdx2D++;
      }
      //why we need syncthreads for correctness??
      __syncthreads();

      //start target row of the output 2D matrix
      int outrow = global_tidx * tile_sz;
      //perform matmul of 64x64 chunks from both matrix A and B.
      for (int rowt=0; rowt<tile_sz; rowt++) {
        int row = threadIdx.x * tile_sz + rowt;
        //start target column of the output 2D matrix.
        int outcol = global_tidy * tile_sz;
        for (int colt=0; colt<tile_sz; colt++) {
          int col  = threadIdx.y * tile_sz + colt;
          int temp = 0;
          for (int k=0; k < chunk_size; k++) {
            if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) printf("UA: %d,%d\n", row, k);
            if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) printf("UB: %d,%d\n", k, col);
            temp += matA[row][k] * matB[k][col];
            if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) {
                    if (matA[row][k] != 1234) printf("oopsA %d,%d\n", row, k);
                    if (matB[k][col] != 4321) printf("oopsB %d,%d\n", k, col);}
          }
          result[outrow * ncols + outcol] += temp;
          outcol++;
        }
        outrow++;
      }
    }
    return;
}

int main(){
  int *A, *B, *result;
  const int ms = 2048;
  const int gs = 32;
  const int bs = 8;
  cudaMalloc(&A, sizeof(int)*ms*ms);
  cudaMalloc(&B, sizeof(int)*ms*ms);
  cudaMalloc(&result, sizeof(int)*ms*ms);
  cudaMemset(A, 0, sizeof(int)*ms*ms);
  cudaMemset(B, 0, sizeof(int)*ms*ms);
  matmul_2048<<<dim3(gs,gs), dim3(bs,bs)>>>(A, B, ms, ms, bs, result);
  cudaDeviceSynchronize();
}

# nvcc -o t272 t272.cu
# ./t272 |grep "7,63"
thread 0,4 is loading matA 7,63
thread 0,5 is loading matA 7,63
thread 0,6 is loading matA 7,63
thread 0,7 is loading matA 7,63
thread 0,0 is loading matA 7,63
thread 0,1 is loading matA 7,63
thread 0,2 is loading matA 7,63
thread 0,3 is loading matA 7,63
LA: 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
#

So in a particular threadblock (0,0), for a particular chunk (0), you have multiple threads loading the same location 7,63 in the matA shared chunk. That can’t be right and doesn’t make sense, but also gives rise to the situation where a particular thread ends up using a value that it didn’t load, which is a violation of your claim.

You probably need to work on your indexing, I doubt it is doing what you think/claim.

Hi @Robert_Crovella Thanks a lot for your help, now I understand what is the issue here is, I think it is a result of bad indexing into the input matrices. Your script is handy to test the solutions. In my code, having syncthreads just hides the fact different threads writing same value to the same address in each time-step. In other words, it’s just hides the issue with the code.