Matrix multiplication in CUDA

I am trying to learn CUDA. I started to try matrix multiplication with the help of this article based on GPU. My main problem is that I am unable too understand how to access 2D array in Kernel since accessing a 2D array is a bit different than the conventional method (matrix[i][j]). This is the part where i am stuck:

for (int i = 0; i < N; i++) {
    tmpSum += A[ROW * N + i] * B[i * N + COL];
C[ROW * N + COL] = tmpSum;

I could understand how ROW and COLUMN were derived.

int ROW = blockIdx.y*blockDim.y+threadIdx.y;
int COL = blockIdx.x*blockDim.x+threadIdx.x;

Any explanation with an example is highly appreciated. Thanks!

What is “the conventional method (matrix[ i][j])”? Since CUDA is a subset of C++, all array accesses work just the same as they do in C++.

I suspect what you mean by the “conventional” method is really the low-performance representation of a 2D matrix as a vector of pointers to separately allocated row vectors. That is actually not a conventional method (by historical standards) but a fairly useless modern way of representing a 2D matrix (hint: try performing matrix operations on an arbitrary sub-matrix of a matrix represented in this way).

The conventional way to represent a 2D matrix, going back to the 1950s, is to use a single allocation in which rows, or columns, are stored contiguously, depending on whether row-major storage layout is used (C, C++) or column-major layout (Fortran). Note: CUBLAS uses column-major storage layout for compatibility with other BLAS libraries and ease of interfacing with programs written in Fortran. So if you intend to use CUBLAS later on, you are better off using column-major storage layout from the start.

Your example seems to be using row-major storage, with each row comprising N elements. So row 0 comprises elements 0 … N-1 of the allocation, row 1 comprises elements N … 2N-1, etc. Therefore, row ROW starts at element ROW * N, and the individual columns in that row are at ROWN + 0, ROWN + 1, or generally, ROWN + i for column i. How data elements are mapped to CUDA threads is entirely up to the programmer. In general, a good default strategy is to assign one thread to each element of the result matrix (or a tile of the result matrix). If the matrix is a 2D matrix, it may be a good idea to use a 2-D thread block since that makes it easier for a programmer to keep track of which thread operates on which element of that matrix or tile.