Need of _sync_threads in disjoint set of address access

pfv5068 · September 3, 2024, 5:21pm

Hi Guys,
I’m writing a matmul kernel, with different approaches to measure the performance.
Input and kernel config:
Both input matrix A and B are 2048 x 2048 matrices. I have a (32, 32, 0) as grid dimension and (8,8,0) as block dimension.
My kernel setup is as follows:
Each block is responsible for computing 64 x 64 sub matrix of the resultant matrix. Since we have (8 x 8) block, each thread will compute 8 x 8 sub matrix of the resultant matrix.
Kernel also uses shared memory for two 64 x 64 matrices for both matrix A and matrix B.
For instance, since tile size is 8 each thread will load (8 x 2048) sub matrix from the matrix A and (2048 x 8) sub matrix from the matrix B to the shared memory and perform matmul to obtain (8 x 8) sub matrix . Due to the SM size (48 KB) limitation per thread block. I’m splitting the SM loading step into multiple time steps, where in each time step a single thread will load strides of size 64.

__global__ void matmul_2048(const int *A, const int *B, int nrows, 
                     int ncols, int tile_sz, int *result) {
    int global_tidx   = blockIdx.x * blockDim.x + threadIdx.x;
    int global_tidy   = blockIdx.y * blockDim.y + threadIdx.y;
    int thr_mat_range;
    int startRowIdx2D = global_tidx * tile_sz;
    int startColIdx2D = global_tidy * tile_sz;
    
    const int chunk_size = 64;
    __shared__ int matA[chunk_size][chunk_size], matB[chunk_size][chunk_size];
    int chunks = nrows / chunk_size;

    /*
    0. Split rows into chunk_size chunks (aka Timesteps)
    1. Load 64 x 64 submatrix of A into SM matA
    2. Load 64 x 64 submatrix of B into SM matB
    3. Perform 64 x 64 matmul operation and write to global memory.
    */
    for (int i=0; i < chunks; i++) {
      //corresponding start row index of 2D matrix A, the
      //thread is responsible for.
      startRowIdx2D = global_tidx * tile_sz;
      for (int t=0; t < tile_sz; t++) {
        int rowIdx1D  = startRowIdx2D * ncols + i * chunk_size;
        thr_mat_range = threadIdx.x * tile_sz + t;
        for (int j=0; j < chunk_size; j++) {
          matA[thr_mat_range][j] = A[rowIdx1D + j];
        }
        startRowIdx2D++;
      }

      //corresponding start column index of 2D matrix B, the
      //thread is responsible for.      
      startColIdx2D = global_tidy * tile_sz;
      int startRow = i * chunk_size;
      for (int t=0; t < tile_sz; t++) {
        thr_mat_range = threadIdx.y * tile_sz + t;
        for (int j=0; j < chunk_size; j++) {
          int colIdx1D = (startRow + j) * ncols + startColIdx2D;
          matB[j][thr_mat_range] = B[colIdx1D];
        }
        startColIdx2D++;
      }
      //why we need syncthreads for correctness??
      __syncthreads();

      //start target row of the output 2D matrix
      int outrow = global_tidx * tile_sz;
      //perform matmul of 64x64 chunks from both matrix A and B.
      for (int rowt=0; rowt<tile_sz; rowt++) {
        int row = threadIdx.x * tile_sz + rowt;
        //start target column of the output 2D matrix.
        int outcol = global_tidy * tile_sz;
        for (int colt=0; colt<tile_sz; colt++) {
          int col  = threadIdx.y * tile_sz + colt;
          int temp = 0;
          for (int k=0; k < chunk_size; k++) {
            temp += matA[row][k] * matB[k][col];
          }
          result[outrow * ncols + outcol] += temp;
          outcol++;
        }
        outrow++;
      }
    }
    return;
}

Each thread in the entire gird access independent addresses without any overlapping between any two threads. I’m not sure why I need to perform __syncthreads() after loading the sub matrices into shared memory before performing the matmul operation. I think since each thread’s address access is different, I don’t think there is a datarace. I have also run the kernel under compute-sanitizer racecheck tool, but the tool didn’t give any errors.

Interestingly, If I replace the temp with a direct write to the result array, I get more mismatches between expected (CPU) and produced (GPU) values. Delaying the write with an additional register (temp), produces fewer mismatched values.

Could you please help me with the reasoning for the need of __sync_threads for correctness?
Please let me know if I’m making any mistakes here.
Thank you very much,

Robert_Crovella · September 3, 2024, 6:46pm

presumably that claim must be false

test case:

# cat t272.cu
#include <cstdio>

__global__ void matmul_2048(const int *A, const int *B, int nrows,
                     int ncols, int tile_sz, int *result) {
    int global_tidx   = blockIdx.x * blockDim.x + threadIdx.x;
    int global_tidy   = blockIdx.y * blockDim.y + threadIdx.y;
    int thr_mat_range;
    int startRowIdx2D = global_tidx * tile_sz;
    int startColIdx2D = global_tidy * tile_sz;

    const int chunk_size = 64;
    __shared__ int matA[chunk_size][chunk_size], matB[chunk_size][chunk_size];
    int chunks = nrows / chunk_size;

    /*
    0. Split rows into chunk_size chunks (aka Timesteps)
    1. Load 64 x 64 submatrix of A into SM matA
    2. Load 64 x 64 submatrix of B into SM matB
    3. Perform 64 x 64 matmul operation and write to global memory.
    */
    for (int i=0; i < chunks; i++) {
      //corresponding start row index of 2D matrix A, the
      //thread is responsible for.
      startRowIdx2D = global_tidx * tile_sz;
      for (int t=0; t < tile_sz; t++) {
        int rowIdx1D  = startRowIdx2D * ncols + i * chunk_size;
        thr_mat_range = threadIdx.x * tile_sz + t;
        for (int j=0; j < chunk_size; j++) {
          if ((thr_mat_range == 7) && (j == 63) && !i && !blockIdx.x && !blockIdx.y) printf("thread %d,%d is loading matA 7,63\n", threadIdx.x, threadIdx.y);
          if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) printf("LA: %d,%d\n", thr_mat_range, j);
          matA[thr_mat_range][j] = A[rowIdx1D + j];
          if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) matA[thr_mat_range][j] = 1234;
        }
        startRowIdx2D++;
      }

      //corresponding start column index of 2D matrix B, the
      //thread is responsible for.
      startColIdx2D = global_tidy * tile_sz;
      int startRow = i * chunk_size;
      for (int t=0; t < tile_sz; t++) {
        thr_mat_range = threadIdx.y * tile_sz + t;
        for (int j=0; j < chunk_size; j++) {
          int colIdx1D = (startRow + j) * ncols + startColIdx2D;
          if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) printf("LB: %d,%d\n", j, thr_mat_range);
          matB[j][thr_mat_range] = B[colIdx1D];
          if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) matB[j][thr_mat_range] = 4321;
        }
        startColIdx2D++;
      }
      //why we need syncthreads for correctness??
      __syncthreads();

      //start target row of the output 2D matrix
      int outrow = global_tidx * tile_sz;
      //perform matmul of 64x64 chunks from both matrix A and B.
      for (int rowt=0; rowt<tile_sz; rowt++) {
        int row = threadIdx.x * tile_sz + rowt;
        //start target column of the output 2D matrix.
        int outcol = global_tidy * tile_sz;
        for (int colt=0; colt<tile_sz; colt++) {
          int col  = threadIdx.y * tile_sz + colt;
          int temp = 0;
          for (int k=0; k < chunk_size; k++) {
            if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) printf("UA: %d,%d\n", row, k);
            if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) printf("UB: %d,%d\n", k, col);
            temp += matA[row][k] * matB[k][col];
            if (!i && !threadIdx.x && !threadIdx.y && !blockIdx.x && !blockIdx.y) {
                    if (matA[row][k] != 1234) printf("oopsA %d,%d\n", row, k);
                    if (matB[k][col] != 4321) printf("oopsB %d,%d\n", k, col);}
          }
          result[outrow * ncols + outcol] += temp;
          outcol++;
        }
        outrow++;
      }
    }
    return;
}

int main(){
  int *A, *B, *result;
  const int ms = 2048;
  const int gs = 32;
  const int bs = 8;
  cudaMalloc(&A, sizeof(int)*ms*ms);
  cudaMalloc(&B, sizeof(int)*ms*ms);
  cudaMalloc(&result, sizeof(int)*ms*ms);
  cudaMemset(A, 0, sizeof(int)*ms*ms);
  cudaMemset(B, 0, sizeof(int)*ms*ms);
  matmul_2048<<<dim3(gs,gs), dim3(bs,bs)>>>(A, B, ms, ms, bs, result);
  cudaDeviceSynchronize();
}

# nvcc -o t272 t272.cu
# ./t272 |grep "7,63"
thread 0,4 is loading matA 7,63
thread 0,5 is loading matA 7,63
thread 0,6 is loading matA 7,63
thread 0,7 is loading matA 7,63
thread 0,0 is loading matA 7,63
thread 0,1 is loading matA 7,63
thread 0,2 is loading matA 7,63
thread 0,3 is loading matA 7,63
LA: 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
UA: 7,63
oopsA 7,63
#

So in a particular threadblock (0,0), for a particular chunk (0), you have multiple threads loading the same location 7,63 in the matA shared chunk. That can’t be right and doesn’t make sense, but also gives rise to the situation where a particular thread ends up using a value that it didn’t load, which is a violation of your claim.

You probably need to work on your indexing, I doubt it is doing what you think/claim.

pfv5068 · September 3, 2024, 10:05pm

Hi @Robert_Crovella Thanks a lot for your help, now I understand what is the issue here is, I think it is a result of bad indexing into the input matrices. Your script is handy to test the solutions. In my code, having syncthreads just hides the fact different threads writing same value to the same address in each time-step. In other words, it’s just hides the issue with the code.

Topic		Replies	Views
using syncthreads still at n00b status CUDA Programming and Performance	4	16057	December 1, 2010
__syncthreads screwes calculation CUDA Programming and Performance	2	3402	November 22, 2007
__syncthreads() not syncing the threads, although not in if statement CUDA Programming and Performance	1	642	April 26, 2016
__syncthreads() issue CUDA Programming and Performance	10	1238	February 10, 2011
__syncthreads() not syncing CUDA Programming and Performance	0	1237	August 16, 2008
does this code have problem? CUDA Programming and Performance	6	3928	December 9, 2007
syncthreads() in loop why does this work? CUDA Programming and Performance	1	13748	August 1, 2008
__syncthreads and blocking memory accesses CUDA Programming and Performance	1	3758	February 5, 2009
syncthreads() issue CUDA Programming and Performance	3	1705	March 29, 2009
syncthreads error? CUDA Programming and Performance	16	33281	June 2, 2008

Need of _sync_threads in disjoint set of address access

Related topics