Matrix Multiplication Bank conflicts problem

I am trying to implement a fast matrix multiplication on GPU using shared memory and I use the common example used in CUDA best practice guidelines which tiles both A&B matrices and load these tiles into shared memory, I am comparing this implementation with CUBLAS and it is much slower than CUBLAS, when profiling the code and looking into its SASS, I can see there are execution dependencies for FFMA instructions (probably due to bank conflicts), and when comparing it to CUBLAS sass, I can see it is resolved in its SASS using .reuse for registers which is register cache I suppose. My question is how can I resolve this problem on C-level and if there is any other techniques to resolve it. Thanks

This is my code

__global__ void gpu_square_matrix_mult(float *d_a, float *d_b, float *d_result, int n) 
{
    __shared__ float tile_a[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float tile_b[BLOCK_SIZE][BLOCK_SIZE];

    int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
    int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
    float tmp = 0;
    int idx;
    int sub;
    for ( sub = 0; sub < gridDim.x; ++sub) 
    {
        idx = row * n + sub * BLOCK_SIZE + threadIdx.x;
        if(idx >= n*n)
        {
            tile_a[threadIdx.y][threadIdx.x] = 0;
        }
        else
        {
            tile_a[threadIdx.y][threadIdx.x] = d_a[idx];
        }

        idx = (sub * BLOCK_SIZE + threadIdx.y) * n + col;
        if(idx >= n*n)
        {
            tile_b[threadIdx.y][threadIdx.x] = 0;
        }  
        else
        {
            tile_b[threadIdx.y][threadIdx.x] = d_b[idx];
        }
        __syncthreads();
        
        for (int k = 0; k < BLOCK_SIZE; ++k) 
        {
            tmp += tile_a[threadIdx.y][k] * tile_b[k][threadIdx.x];
        }
        __syncthreads();
    }
    if(row < n && col < n)
    {
        d_result[row * n + col] = tmp;
    }

}

@scottgray I have seen your walk through article example of SGEMM on Maxwell Architecture, it is very helpful but unfortunately I am not using Maxwell arch, do you have an idea on how to resolve bank conflicts using Nvidia’s assembler and if the operand collector is taking care of it?

It’s hard to imagine what you’ve done with the code you posted. The shared tile variable types are int. That would create lots of weirdness at the SASS level, not to mention being functionally incorrect.

By the way, to make your posted code easier to read, you can select it and then click on the </> button at the top of the edit window. This will put your code in a more nicely formatted box.

__global__ void gpu_square_matrix_mult(float *d_a, float *d_b, float *d_result, int n) 
{
__shared__ float tile_a[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float tile_b[BLOCK_SIZE][BLOCK_SIZE];

int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
float tmp = 0;
int idx;
int sub;
for ( sub = 0; sub < N/BLOCK_SIZE; ++sub) {
   idx = row * n + sub * BLOCK_SIZE + threadIdx.x;
   if(idx >= n*n){
      tile_a[threadIdx.y][threadIdx.x] = 0;
   }
   else{
       tile_a[threadIdx.y][threadIdx.x] = d_a[idx];
   }

   idx = (sub * BLOCK_SIZE + threadIdx.y) * n + col;
   if(idx >= n*n){
      tile_b[threadIdx.y][threadIdx.x] = 0;
   } 
   else{
   tile_b[threadIdx.y][threadIdx.x] = d_b[idx];
   }
__syncthreads();

for (int k = 0; k < BLOCK_SIZE; ++k) {
    tmp += tile_a[threadIdx.y][k] * tile_b[k][threadIdx.x];
}
__syncthreads();
}
if(row < n && col < n){
   d_result[row * n + col] = tmp;
}

}

@txbob you are right it should be float, however using int
still works but creates lots of instructions that converts int to float.

not sure what “works” means. it would not give the correct answer for most float inputs

It’s also hard to imagine that you witnessed a FFMA instruction, but, whatever. I would have expected an integer multiply, not a floating-point multiply.

Hopefully Scott Gray will come by and address your questions. I suspect, however, that if he had found a way to resolve this problem on C-level or any other technique, he would not have resorted to the herculean effort of creating his own assembler and crafting his assembly code.

Also, NVIDIA provides no assembler, that I am aware of.

@txbob any techniques you would suggest to resolve bank conflicts?

are you talking about shared memory bank conflicts? or are you referring to some aspect of register usage?

@txbob Shared memory bank conflicts

what are you using for BLOCK_SIZE ?

32 and 16.
But 32 is faster.
And this is my config.
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE/SIZE);
dim3 dimGrid(n/BLOCK_SIZE, n/BLOCK_SIZE);

@txbob
I also notice that my kernel is as fast as CUBLAS when n is equal or smaller than 256, but for bigger matrices, it is slower.

For BLOCK_SIZE of 32, there should be no shared memory bank conflicts.

However this looks wrong to me:

dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE/SIZE);

it should be:

dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);

Yes that’s a typo, It is dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
But I get a lot of execution dependencies for FFMA instructions when I profile it on NVVP as seen in the attachment.

an execution dependency and a shared memory bank conflict are quite different concepts

you indicated you were asking about shared memory bank conflicts

if you want to find out whether there are shared memory bank conflicts, the profiler can tell you that

So my new question is why do you think this code has execution dependency?

https://github.com/NervanaSystems/maxas/wiki/SGEMM

Thank you, that was very helpful