Can you please explain these formulas?

What do the following formulas stand for in the given CUDA kernel?

A[row*N+(tile*TILE_DIM + threadIdx.x)];
B[(tile*TILE_DIM + threadIdx.y) *N + col];

Can you simplify thrm to make them more understandable?

As far as I underatand,

mat_element_global_x = tile*TILE_DIM + threadIdx.x
mat_element_global_y = tile*TILE_DIM + threadIdx.y

So, the formula becomes:

A[row *N + mat_element_global_x];
B[mat_element_global_y *N + col];

We know that if a matrix is in row-major order, then the linear index of an element would be

linear_index 
= mat_element_global_y * COL_WIDTH 
+ mat_element_global_x

That means, A is a row-major matrix, but B is a column major matrix.

Is my analysis correct?


Kernel source code:

#include "common.h" 
#include "timer.h"

#define TILE_DIM 32

__global__ void mm_tiled_kernel (float* A, float* B, float* C, unsigned int N) 
{
    unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
    unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;

    __shared__ float A S[TILE_DIM] [TILE_DIM]; 
    __shared__ float B_S[TILE_DIM] [TILE_DIM];

    float sum = 0.0f;

    for (unsigned int tile = 0; tile < N/TILE_DIM; ++ tile) 
    {
        A_s [threadIdx.y] [threadIdx.x] = A[row*N+tile*TILE_DIM + threadIdx.x];
        B_s[threadIdx.y] [threadIdx.x] = B[(tile*TILE_DIM + threadIdx.y) *N + col];
        _syncthreads();

        for (unsigned int i = 0; i < TILE_DIM; ++i) 
        {
            sum += As[threadIdx.y] [i]*B_s[i] [threadIdx.x];
        }
        _syncthreads();
    }
    C[row*N+ col] = sum;
}

the kernel you have posted isn’t valid CUDA code.

Check the following video at 1:22:00 =

try compiling the code you posted

No, that’s not correct. in each case (for A and for B) we are loading a square tile into a shared memory array. The loading of this tile does not determine the memory storage order. The tile is simply a “snapshot” or copy of a particular section of the underlying matrix, in the same order. There is some indication of the expected memory storage order in the actual multiplication loop:

Since the reference into A_s there is effectively selecting a row of A_s (the column varies by the loop index, but the row is constant across the loop) and the reference into B_s is effectively selecting a column of B_s, we can expect that this is normal rowxcolumn vector dot product, and therefore is consistent with both matrices A and B being stored in row-major order.

The reason there is a variation in indexing in the loading of A_s and B_s is because the selection of tiles to be loaded moves horizontally across A and vertically down B. This is to facilitate the idea that in order to compute a complete vector dot product of one row of A by one column of B, we need the whole row in A and the whole column in B. Therefore the selected tile for A_s is chosen horizontally across A (following the row direction) and for B_s vertically down B (following the column direction).

why does it give incorrect results if I set matrix width =2 and tile width=1?

I don’t know. You haven’t provided a complete code for me to test, and the code you have provided won’t compile. Good luck!

I get correct results for this case where matrix width is 2 and tile width is 1:

# cat t16.cu
#include <iostream>
const int my_N=2;
const int TILE_DIM=1;
__global__ void mm_tiled_kernel (float* A, float* B, float* C, unsigned int N)
{
    unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
    unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;

    __shared__ float A_s[TILE_DIM] [TILE_DIM];
    __shared__ float B_s[TILE_DIM] [TILE_DIM];

    float sum = 0.0f;

    for (unsigned int tile = 0; tile < N/TILE_DIM; ++ tile)
    {
        A_s[threadIdx.y][threadIdx.x] = A[row*N+tile*TILE_DIM + threadIdx.x];
        B_s[threadIdx.y][threadIdx.x] = B[(tile*TILE_DIM + threadIdx.y) *N + col];
        __syncthreads();

        for (unsigned int i = 0; i < TILE_DIM; ++i)
        {
            sum += A_s[threadIdx.y][i]*B_s[i][threadIdx.x];
        }
        __syncthreads();
    }
    C[row*N+ col] = sum;
}


int main(){
  float *A, *B, *C;
  cudaMallocManaged(&A, my_N*my_N*sizeof(A[0]));
  cudaMallocManaged(&B, my_N*my_N*sizeof(A[0]));
  cudaMallocManaged(&C, my_N*my_N*sizeof(A[0]));
  for (int i = 0; i < my_N*my_N; i++) A[i] = B[i] = i+1;
  mm_tiled_kernel<<<dim3(my_N/TILE_DIM,my_N/TILE_DIM), dim3(TILE_DIM,TILE_DIM)>>>(A, B, C, my_N);
  cudaDeviceSynchronize();
  for (int i = 0; i < my_N*my_N; i++) std::cout << C[i] << std::endl;
}
# nvcc -o t16 t16.cu
# ./t16
7
10
15
22
#
1 Like

Thanks a lot!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.