Subtle problem with matrix-matrix multiplication

rreddy78 · November 13, 2020, 4:23pm

Hi all,

I am writing a general matrix-matrix multiplication and I run into a subtle problem.
I have checked the code many times and it seems correct.

[A_Rows,A_Cols] x [B_Rows,B_Cols] = [C_Rows,C_Cols]
So I know that for matrix multiplication A_Cols == B_Rows. Now C_Rows = A_Rows and C_Cols = B_Cols.

Here is my code

#include <iostream>
#include <cmath>

const int NUM_REPS = 100;

using namespace std;

// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
inline cudaError_t checkCuda(cudaError_t result)
{
  if (result != cudaSuccess)
  {
    fprintf(stderr, "File %s, Line %d: CUDA Runtime Error: %s\n", __FILE__, __LINE__, cudaGetErrorString(result));
    exit(EXIT_FAILURE);
  }
  return result;
}
__global__ void matrixMultiplicationKernel(const float *__restrict__ A, const int A_Rows, const int A_Cols, const float *__restrict__ B, const int B_Rows, const int B_Cols, float *__restrict__ C, const int C_Rows, const int C_Cols)
{
  const int ROW = blockIdx.y * blockDim.y + threadIdx.y;
  const int COL = blockIdx.x * blockDim.x + threadIdx.x;

  float tmpSum = 0;

  //printf("(%d * %d) + %d = X:%d, (%d * %d) + %d = Y:%d\n",  blockIdx.y, blockDim.y, threadIdx.y, ROW,blockIdx.x, blockDim.x, threadIdx.x, COL );

  //printf("(%d,%d)\n", ROW, COL);

  if (ROW < C_Rows && COL < C_Cols)
  {
    // each thread computes one element of the block sub-matrix
    for (int i = 0; i < A_Cols; i++)
    {
      tmpSum += A[ROW * A_Cols + i] * B[i * B_Cols + COL];
    }
    C[ROW * C_Cols + COL] = tmpSum;
  }
}

void matrixMultiplication(const float *__restrict__ A, const int A_Rows, const int A_Cols, const float *__restrict__ B, const int B_Rows, const int B_Cols, float *__restrict__ C)
{
  // declare the number of blocks per grid and the number of threads per block
  // use 1 to 512 threads per block
  dim3 threadsPerBlock(1, 1);
  dim3 blocksPerGrid(1, 1);

  if (A_Cols == B_Rows)
  {
    const int R_C_Rows = A_Rows;
    const int R_C_Cols = B_Cols;

    // Maximum number of threads per block is 1024 for every SM
    threadsPerBlock.x = 32;
    threadsPerBlock.y = 32;

    blocksPerGrid.x = (R_C_Rows + 31) / 32;
    blocksPerGrid.y = (R_C_Cols + 31) / 32;

    printf("M_C[%d x %d]. grid config = (%d,%d) for block config = (32x32)\n",R_C_Rows, R_C_Cols, blocksPerGrid.x, blocksPerGrid.y);

    matrixMultiplicationKernel<<<blocksPerGrid, threadsPerBlock>>>(A, A_Rows, A_Cols, B, B_Rows, B_Cols, C, R_C_Rows, R_C_Cols);
    checkCuda(cudaDeviceSynchronize());
  }
}

void initializeMatrix(float *M, const int M_Rows, const int M_Cols, float value)
{
  for (int i = 0; i < M_Rows; ++i)
    for (int j = 0; j < M_Cols; ++j)
      M[i * M_Cols + j] = value;
}

int verifyResult(float *M, const int M_Rows, const int M_Cols, float checkValue)
{
  for (int i = 0; i < M_Rows; ++i)
    for (int j = 0; j < M_Cols; ++j)
      if (fabs(M[i * M_Cols + j] - checkValue) > 0.0001f) {
        std::cerr << "Error greater than tolerance at " << i << "," << j << ": " << M[i * M_Cols + j] << std::endl;
        return 0;
      }
  return 1;    
}

int main(void)
{
  const int A_Rows = 500;
  const int A_Cols = 250;

  const int B_Rows = 250;
  const int B_Cols = 550;

  float *X;
  float *Y;
  float *Z;

  const int C_Rows = A_Rows;
  const int C_Cols = B_Cols;

  checkCuda(cudaMallocManaged(&X, sizeof(float) * A_Rows * A_Cols));
  checkCuda(cudaMallocManaged(&Y, sizeof(float) * B_Rows * B_Cols));
  checkCuda(cudaMallocManaged(&Z, sizeof(float) * C_Rows * C_Cols));

  initializeMatrix(X, A_Rows, A_Cols, 1.0f);
  initializeMatrix(Y, B_Rows, B_Cols, 2.0f);
  initializeMatrix(Z, C_Rows, C_Cols, 3.0f);

  matrixMultiplication(X, A_Rows, A_Cols, Y, B_Rows, B_Cols, Z);

  

  if(verifyResult(Z, C_Rows, C_Cols, A_Cols * 2.0f) == 1)
  {
    //  for (int i = 0; i < NUM_REPS; i++)
    //     matrixMultiplication(X, A_Rows, A_Cols, Y, B_Rows, B_Cols, Z);
  }
  
  checkCuda(cudaFree(X));
  checkCuda(cudaFree(Y));
  checkCuda(cudaFree(Z));

  checkCuda(cudaDeviceReset());

  return 0;
}

Now the code works when A_Rows == B_Cols (e.g 500) but does not work when they differ!

rreddy78 · November 13, 2020, 4:48pm

Solved it.

const int ROW = blockIdx.x * blockDim.x + threadIdx.x;
const int COL = blockIdx.y * blockDim.y + threadIdx.y;

But I simply do not understand this because in CUDA arent the indexes interchanged when compared to C i.e (0,0), (1,0), (2,0), (3,0) etc. ?

Robert_Crovella · November 13, 2020, 5:28pm

That is a problem. You have reversed the sense of x (should be COLS) and y (should be ROWS) when creating your grid.

FWIW there is a matrix multply example in the CUDA programming guide, that you can compare to.

njuffa · November 13, 2020, 6:15pm

Not sure what you mean. CUDA is a language in the C++ family, and the row-major storage convention applies to multi-dimensional arrays in this language family. Same as in C.

That does not mean one cannot use these languages to write software that follows the column-major storage convention. CUBLAS for example follows a column-major storage convention to guarantee interoperability with other versions of BLAS.

rreddy78 · November 14, 2020, 6:03am

As I understood CUDA’s thread and block enumeration is like.
For example. blocksPerGrid.x = 2, blockPerGrid.y = 3 then the blocks are enumerated in (X,Y) as

(0,0) (1,0) (2,0)
(0,1) (1,1) (2,1)

So x provides the column index (dim = 3) and y provides the row index (dim = 2). The same enumeration applies for threads in a block also. Seems opposite of C’s row major order.
Its a bit of a confusion for me…
blockDim.x == 2, blockDim.y = 3
but blockIdx.x iterates from 0-3 and blockIdx.y iterates from 0-1 ? within the kernel ?

From programming guide:

rreddy78 · November 14, 2020, 7:10am

Thanks. So in the example shown in the Programming Guide, the indexes are interchanged at both places. I did the same too later.

I mean:

// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

So Grid.x is set with width (columns) and Grid.y is set with height (rows)

And within the kernel too the x and y are interchanged.

// by accumulating results into Cvalue
float Cvalue = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y; // Rows use y and
int col = blockIdx.x * blockDim.x + threadIdx.x; // Columns use x

njuffa · November 14, 2020, 7:19am

The thread block enumeration is a hardware thing. The storage convention for multi-dimensional arrays is a software thing. They are orthogonal mechanisms.

CUDA allows for arbitrary mappings between threads and thread blocks and data. Some mappings may be more conducive to performance than others.

I understand this can be a bit unfamiliar and confusing in the beginning, and as one of the first five people to program in CUDA I assure you I struggled with this myself, especially since I have relatively weak spatial reasoning. I drew a lot of diagrams in those early days.

rreddy78 · November 15, 2020, 10:32am

For sure. Thanks. I got my head around this with writing out diagrams as suggested.

In summary only

Column-wise
blocksPerGrid.x = (C_Rows + 31) / 32; // Use Rows for block’s X dimension
blocksPerGrid.y = (C_Cols + 31) / 32; // Use Cols for blocks’s Y dimension
// Access in kernel is x mapped to ROW and y mapped to COL
const int ROW = blockIdx.x * blockDim.x + threadIdx.x;
const int COL = blockIdx.y * blockDim.y + threadIdx.y;

This will result in a C column-wise iteration of the C matrix (performance is not good) by “transposing” the CUDA mapping to C layout

Row-wise

blocksPerGrid.y = (C_Rows + 31) / 32; // Use Rows for block’s Y dimension
blocksPerGrid.x = (C_Cols + 31) / 32; // Use Cols for blocks’s X dimension
// Access in kernel is x mapped to COL and y mapped to ROW
const int COL = blockIdx.x * blockDim.x + threadIdx.x;
const int ROW = blockIdx.y * blockDim.y + threadIdx.y;

Reversing fully will result in 1:1 mapping to C layout.
Will result in a C row-wise iteration of the C matrix (good performance dues to global memory coalescing)

FWIW, I made copy of what I wrote in a document. Attaching screenshot as i cant upload ods.

rreddy78 · November 15, 2020, 10:35am

Also : One more thing I think CUDA has this indexing approach either for hardware performance reasons or to be compatible with FORTRAN ( lots of HPC code)

rreddy78 · November 15, 2020, 10:40am

Performance results

Case 1: Colum-wise computation time for 1000x1000 matrices

Average time: 1.16S.

rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./matrix_mul_gen_cuda
==16289== NVPROF is profiling process 16289, command: ./matrix_mul_gen_cuda
M_C[1000 x 1000]. grid config = (32,32) for block config = (32x32)
...
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "NVIDIA Tegra X1 (0)"
    Kernel: matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
          1                            gst_efficiency            Global Memory Store Efficiency      12.50%      12.50%      12.50%
          1                            gld_efficiency             Global Memory Load Efficiency      12.50%      12.50%      12.50%
          1                            gld_throughput                    Global Load Throughput  23.819GB/s  23.819GB/s  23.819GB/s
          1                            gst_throughput                   Global Store Throughput  23.634MB/s  23.634MB/s  23.634MB/s

Case 2: Row-wise computation time for 1000x1000 matrices

Average time: 0.164ms

rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./matrix_mul_gen_cuda
==16521== NVPROF is profiling process 16521, command: ./matrix_mul_gen_cuda
M_C[1000 x 1000]. grid config = (32,32) for block config = (32x32)
...
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "NVIDIA Tegra X1 (0)"
    Kernel: matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                            gld_efficiency             Global Memory Load Efficiency      82.17%      82.17%      82.17%
          1                            gld_throughput                    Global Load Throughput  10.490GB/s  10.490GB/s  10.490GB/s
          1                            gst_throughput                   Global Store Throughput  8.5520MB/s  8.5520MB/s  8.5520MB/s

rreddy78 · December 12, 2020, 2:47pm

When I checked out the matrix multiplication in CUBLAS, I see that the execution configuration quite different than what I expected.
The internal kernel is: sgemm_128x128x8_NN_vec

2.96980s 58.708ms Grid Size = (4 4 4) Block Size = (256 1 1) 128 16.516KB

It seems to all depend on the algorithm that uses the execution configuration to get the result…

Topic		Replies	Views
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	766	August 26, 2018
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10006	January 18, 2012
Matrix Multiplication Garbage value :( CUDA Programming and Performance	10	3406	July 25, 2009
Matrix multiplication woes large inner, small outer dimensions CUDA Programming and Performance	21	10028	March 24, 2009
32 x 32 Matrix Multiplication CUDA Programming and Performance	2	2868	March 5, 2010
matrix multiply reduction CUDA Programming and Performance	41	35550	January 15, 2011
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5601	August 12, 2009
Matrix multiplication CUDA Programming and Performance	7	2154	July 2, 2010
Cuda matrix multiplication too slow CUDA Programming and Performance	5	13333	February 17, 2010
Shared memory matrix multiplication not working CUDA Programming and Performance	6	59	October 11, 2024

Subtle problem with matrix-matrix multiplication

Related topics