Subtle problem with matrix-matrix multiplication

Hi all,

I am writing a general matrix-matrix multiplication and I run into a subtle problem.
I have checked the code many times and it seems correct.

[A_Rows,A_Cols] x [B_Rows,B_Cols] = [C_Rows,C_Cols]
So I know that for matrix multiplication A_Cols == B_Rows. Now C_Rows = A_Rows and C_Cols = B_Cols.

Here is my code

#include <iostream>
#include <cmath>

const int NUM_REPS = 100;

using namespace std;

// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
inline cudaError_t checkCuda(cudaError_t result)
{
  if (result != cudaSuccess)
  {
    fprintf(stderr, "File %s, Line %d: CUDA Runtime Error: %s\n", __FILE__, __LINE__, cudaGetErrorString(result));
    exit(EXIT_FAILURE);
  }
  return result;
}
__global__ void matrixMultiplicationKernel(const float *__restrict__ A, const int A_Rows, const int A_Cols, const float *__restrict__ B, const int B_Rows, const int B_Cols, float *__restrict__ C, const int C_Rows, const int C_Cols)
{
  const int ROW = blockIdx.y * blockDim.y + threadIdx.y;
  const int COL = blockIdx.x * blockDim.x + threadIdx.x;

  float tmpSum = 0;

  //printf("(%d * %d) + %d = X:%d, (%d * %d) + %d = Y:%d\n",  blockIdx.y, blockDim.y, threadIdx.y, ROW,blockIdx.x, blockDim.x, threadIdx.x, COL );

  //printf("(%d,%d)\n", ROW, COL);

  if (ROW < C_Rows && COL < C_Cols)
  {
    // each thread computes one element of the block sub-matrix
    for (int i = 0; i < A_Cols; i++)
    {
      tmpSum += A[ROW * A_Cols + i] * B[i * B_Cols + COL];
    }
    C[ROW * C_Cols + COL] = tmpSum;
  }
}

void matrixMultiplication(const float *__restrict__ A, const int A_Rows, const int A_Cols, const float *__restrict__ B, const int B_Rows, const int B_Cols, float *__restrict__ C)
{
  // declare the number of blocks per grid and the number of threads per block
  // use 1 to 512 threads per block
  dim3 threadsPerBlock(1, 1);
  dim3 blocksPerGrid(1, 1);

  if (A_Cols == B_Rows)
  {
    const int R_C_Rows = A_Rows;
    const int R_C_Cols = B_Cols;

    // Maximum number of threads per block is 1024 for every SM
    threadsPerBlock.x = 32;
    threadsPerBlock.y = 32;

    blocksPerGrid.x = (R_C_Rows + 31) / 32;
    blocksPerGrid.y = (R_C_Cols + 31) / 32;

    printf("M_C[%d x %d]. grid config = (%d,%d) for block config = (32x32)\n",R_C_Rows, R_C_Cols, blocksPerGrid.x, blocksPerGrid.y);

    matrixMultiplicationKernel<<<blocksPerGrid, threadsPerBlock>>>(A, A_Rows, A_Cols, B, B_Rows, B_Cols, C, R_C_Rows, R_C_Cols);
    checkCuda(cudaDeviceSynchronize());
  }
}

void initializeMatrix(float *M, const int M_Rows, const int M_Cols, float value)
{
  for (int i = 0; i < M_Rows; ++i)
    for (int j = 0; j < M_Cols; ++j)
      M[i * M_Cols + j] = value;
}

int verifyResult(float *M, const int M_Rows, const int M_Cols, float checkValue)
{
  for (int i = 0; i < M_Rows; ++i)
    for (int j = 0; j < M_Cols; ++j)
      if (fabs(M[i * M_Cols + j] - checkValue) > 0.0001f) {
        std::cerr << "Error greater than tolerance at " << i << "," << j << ": " << M[i * M_Cols + j] << std::endl;
        return 0;
      }
  return 1;    
}

int main(void)
{
  const int A_Rows = 500;
  const int A_Cols = 250;

  const int B_Rows = 250;
  const int B_Cols = 550;

  float *X;
  float *Y;
  float *Z;

  const int C_Rows = A_Rows;
  const int C_Cols = B_Cols;

  checkCuda(cudaMallocManaged(&X, sizeof(float) * A_Rows * A_Cols));
  checkCuda(cudaMallocManaged(&Y, sizeof(float) * B_Rows * B_Cols));
  checkCuda(cudaMallocManaged(&Z, sizeof(float) * C_Rows * C_Cols));

  initializeMatrix(X, A_Rows, A_Cols, 1.0f);
  initializeMatrix(Y, B_Rows, B_Cols, 2.0f);
  initializeMatrix(Z, C_Rows, C_Cols, 3.0f);

  matrixMultiplication(X, A_Rows, A_Cols, Y, B_Rows, B_Cols, Z);

  

  if(verifyResult(Z, C_Rows, C_Cols, A_Cols * 2.0f) == 1)
  {
    //  for (int i = 0; i < NUM_REPS; i++)
    //     matrixMultiplication(X, A_Rows, A_Cols, Y, B_Rows, B_Cols, Z);
  }
  
  checkCuda(cudaFree(X));
  checkCuda(cudaFree(Y));
  checkCuda(cudaFree(Z));

  checkCuda(cudaDeviceReset());

  return 0;
}

Now the code works when A_Rows == B_Cols (e.g 500) but does not work when they differ!

Solved it.

const int ROW = blockIdx.x * blockDim.x + threadIdx.x;
const int COL = blockIdx.y * blockDim.y + threadIdx.y;

But I simply do not understand this because in CUDA arent the indexes interchanged when compared to C i.e (0,0), (1,0), (2,0), (3,0) etc. ?

That is a problem. You have reversed the sense of x (should be COLS) and y (should be ROWS) when creating your grid.

FWIW there is a matrix multply example in the CUDA programming guide, that you can compare to.

Not sure what you mean. CUDA is a language in the C++ family, and the row-major storage convention applies to multi-dimensional arrays in this language family. Same as in C.

That does not mean one cannot use these languages to write software that follows the column-major storage convention. CUBLAS for example follows a column-major storage convention to guarantee interoperability with other versions of BLAS.

As I understood CUDA’s thread and block enumeration is like.
For example. blocksPerGrid.x = 2, blockPerGrid.y = 3 then the blocks are enumerated in (X,Y) as

(0,0) (1,0) (2,0)
(0,1) (1,1) (2,1)

So x provides the column index (dim = 3) and y provides the row index (dim = 2). The same enumeration applies for threads in a block also. Seems opposite of C’s row major order.
Its a bit of a confusion for me…
blockDim.x == 2, blockDim.y = 3
but blockIdx.x iterates from 0-3 and blockIdx.y iterates from 0-1 ? within the kernel ?

From programming guide:

image

Thanks. So in the example shown in the Programming Guide, the indexes are interchanged at both places. I did the same too later.

I mean:

// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

So Grid.x is set with width (columns) and Grid.y is set with height (rows)

And within the kernel too the x and y are interchanged.

// by accumulating results into Cvalue
float Cvalue = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y; // Rows use y and
int col = blockIdx.x * blockDim.x + threadIdx.x; // Columns use x

The thread block enumeration is a hardware thing. The storage convention for multi-dimensional arrays is a software thing. They are orthogonal mechanisms.

CUDA allows for arbitrary mappings between threads and thread blocks and data. Some mappings may be more conducive to performance than others.

I understand this can be a bit unfamiliar and confusing in the beginning, and as one of the first five people to program in CUDA I assure you I struggled with this myself, especially since I have relatively weak spatial reasoning. I drew a lot of diagrams in those early days.

2 Likes

For sure. Thanks. I got my head around this with writing out diagrams as suggested.

In summary only

  1. Column-wise
    blocksPerGrid.x = (C_Rows + 31) / 32; // Use Rows for block’s X dimension
    blocksPerGrid.y = (C_Cols + 31) / 32; // Use Cols for blocks’s Y dimension
    // Access in kernel is x mapped to ROW and y mapped to COL
    const int ROW = blockIdx.x * blockDim.x + threadIdx.x;
    const int COL = blockIdx.y * blockDim.y + threadIdx.y;

This will result in a C column-wise iteration of the C matrix (performance is not good) by “transposing” the CUDA mapping to C layout

  1. Row-wise

    blocksPerGrid.y = (C_Rows + 31) / 32; // Use Rows for block’s Y dimension
    blocksPerGrid.x = (C_Cols + 31) / 32; // Use Cols for blocks’s X dimension
    // Access in kernel is x mapped to COL and y mapped to ROW
    const int COL = blockIdx.x * blockDim.x + threadIdx.x;
    const int ROW = blockIdx.y * blockDim.y + threadIdx.y;

Reversing fully will result in 1:1 mapping to C layout.
Will result in a C row-wise iteration of the C matrix (good performance dues to global memory coalescing)

FWIW, I made copy of what I wrote in a document. Attaching screenshot as i cant upload ods.

Also : One more thing I think CUDA has this indexing approach either for hardware performance reasons or to be compatible with FORTRAN ( lots of HPC code)

Performance results

Case 1: Colum-wise computation time for 1000x1000 matrices

Average time: 1.16S.

rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./matrix_mul_gen_cuda
==16289== NVPROF is profiling process 16289, command: ./matrix_mul_gen_cuda
M_C[1000 x 1000]. grid config = (32,32) for block config = (32x32)
...
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "NVIDIA Tegra X1 (0)"
    Kernel: matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
          1                            gst_efficiency            Global Memory Store Efficiency      12.50%      12.50%      12.50%
          1                            gld_efficiency             Global Memory Load Efficiency      12.50%      12.50%      12.50%
          1                            gld_throughput                    Global Load Throughput  23.819GB/s  23.819GB/s  23.819GB/s
          1                            gst_throughput                   Global Store Throughput  23.634MB/s  23.634MB/s  23.634MB/s

Case 2: Row-wise computation time for 1000x1000 matrices

Average time: 0.164ms

rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./matrix_mul_gen_cuda
==16521== NVPROF is profiling process 16521, command: ./matrix_mul_gen_cuda
M_C[1000 x 1000]. grid config = (32,32) for block config = (32x32)
...
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "NVIDIA Tegra X1 (0)"
    Kernel: matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                            gld_efficiency             Global Memory Load Efficiency      82.17%      82.17%      82.17%
          1                            gld_throughput                    Global Load Throughput  10.490GB/s  10.490GB/s  10.490GB/s
          1                            gst_throughput                   Global Store Throughput  8.5520MB/s  8.5520MB/s  8.5520MB/s

When I checked out the matrix multiplication in CUBLAS, I see that the execution configuration quite different than what I expected.
The internal kernel is: sgemm_128x128x8_NN_vec

2.96980s 58.708ms Grid Size = (4 4 4) Block Size = (256 1 1) 128 16.516KB

It seems to all depend on the algorithm that uses the execution configuration to get the result…