# Can you please explain these formulas?

What do the following formulas stand for in the given CUDA kernel?

``````A[row*N+(tile*TILE_DIM + threadIdx.x)];
B[(tile*TILE_DIM + threadIdx.y) *N + col];
``````

Can you simplify thrm to make them more understandable?

As far as I underatand,

`mat_element_global_x = tile*TILE_DIM + threadIdx.x`
`mat_element_global_y = tile*TILE_DIM + threadIdx.y`

So, the formula becomes:

``````A[row *N + mat_element_global_x];
B[mat_element_global_y *N + col];
``````

We know that if a matrix is in row-major order, then the linear index of an element would be

``````linear_index
= mat_element_global_y * COL_WIDTH
+ mat_element_global_x
``````

That means, `A` is a row-major matrix, but `B` is a column major matrix.

Is my analysis correct?

Kernel source code:

``````#include "common.h"
#include "timer.h"

#define TILE_DIM 32

__global__ void mm_tiled_kernel (float* A, float* B, float* C, unsigned int N)
{
unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ float A S[TILE_DIM] [TILE_DIM];
__shared__ float B_S[TILE_DIM] [TILE_DIM];

float sum = 0.0f;

for (unsigned int tile = 0; tile < N/TILE_DIM; ++ tile)
{

for (unsigned int i = 0; i < TILE_DIM; ++i)
{
}
}
C[row*N+ col] = sum;
}
``````

the kernel you have posted isn’t valid CUDA code.

Check the following video at 1:22:00 =

try compiling the code you posted

No, that’s not correct. in each case (for A and for B) we are loading a square tile into a shared memory array. The loading of this tile does not determine the memory storage order. The tile is simply a “snapshot” or copy of a particular section of the underlying matrix, in the same order. There is some indication of the expected memory storage order in the actual multiplication loop:

Since the reference into `A_s` there is effectively selecting a row of `A_s` (the column varies by the loop index, but the row is constant across the loop) and the reference into `B_s` is effectively selecting a column of `B_s`, we can expect that this is normal rowxcolumn vector dot product, and therefore is consistent with both matrices A and B being stored in row-major order.

The reason there is a variation in indexing in the loading of `A_s` and `B_s` is because the selection of tiles to be loaded moves horizontally across A and vertically down B. This is to facilitate the idea that in order to compute a complete vector dot product of one row of A by one column of B, we need the whole row in A and the whole column in B. Therefore the selected tile for `A_s` is chosen horizontally across A (following the row direction) and for `B_s` vertically down B (following the column direction).

why does it give incorrect results if I set matrix width =2 and tile width=1?

I don’t know. You haven’t provided a complete code for me to test, and the code you have provided won’t compile. Good luck!

I get correct results for this case where matrix width is 2 and tile width is 1:

``````# cat t16.cu
#include <iostream>
const int my_N=2;
const int TILE_DIM=1;
__global__ void mm_tiled_kernel (float* A, float* B, float* C, unsigned int N)
{
unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ float A_s[TILE_DIM] [TILE_DIM];
__shared__ float B_s[TILE_DIM] [TILE_DIM];

float sum = 0.0f;

for (unsigned int tile = 0; tile < N/TILE_DIM; ++ tile)
{

for (unsigned int i = 0; i < TILE_DIM; ++i)
{
}
}
C[row*N+ col] = sum;
}

int main(){
float *A, *B, *C;
cudaMallocManaged(&A, my_N*my_N*sizeof(A));
cudaMallocManaged(&B, my_N*my_N*sizeof(A));
cudaMallocManaged(&C, my_N*my_N*sizeof(A));
for (int i = 0; i < my_N*my_N; i++) A[i] = B[i] = i+1;
mm_tiled_kernel<<<dim3(my_N/TILE_DIM,my_N/TILE_DIM), dim3(TILE_DIM,TILE_DIM)>>>(A, B, C, my_N);
for (int i = 0; i < my_N*my_N; i++) std::cout << C[i] << std::endl;
}
# nvcc -o t16 t16.cu
# ./t16
7
10
15
22
#
``````
1 Like

Thanks a lot!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.