Matrix Multiplication Bank conflicts problem

momran · August 30, 2018, 3:38am

I am trying to implement a fast matrix multiplication on GPU using shared memory and I use the common example used in CUDA best practice guidelines which tiles both A&B matrices and load these tiles into shared memory, I am comparing this implementation with CUBLAS and it is much slower than CUBLAS, when profiling the code and looking into its SASS, I can see there are execution dependencies for FFMA instructions (probably due to bank conflicts), and when comparing it to CUBLAS sass, I can see it is resolved in its SASS using .reuse for registers which is register cache I suppose. My question is how can I resolve this problem on C-level and if there is any other techniques to resolve it. Thanks

This is my code

__global__ void gpu_square_matrix_mult(float *d_a, float *d_b, float *d_result, int n) 
{
    __shared__ float tile_a[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float tile_b[BLOCK_SIZE][BLOCK_SIZE];

    int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
    int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
    float tmp = 0;
    int idx;
    int sub;
    for ( sub = 0; sub < gridDim.x; ++sub) 
    {
        idx = row * n + sub * BLOCK_SIZE + threadIdx.x;
        if(idx >= n*n)
        {
            tile_a[threadIdx.y][threadIdx.x] = 0;
        }
        else
        {
            tile_a[threadIdx.y][threadIdx.x] = d_a[idx];
        }

        idx = (sub * BLOCK_SIZE + threadIdx.y) * n + col;
        if(idx >= n*n)
        {
            tile_b[threadIdx.y][threadIdx.x] = 0;
        }  
        else
        {
            tile_b[threadIdx.y][threadIdx.x] = d_b[idx];
        }
        __syncthreads();
        
        for (int k = 0; k < BLOCK_SIZE; ++k) 
        {
            tmp += tile_a[threadIdx.y][k] * tile_b[k][threadIdx.x];
        }
        __syncthreads();
    }
    if(row < n && col < n)
    {
        d_result[row * n + col] = tmp;
    }

}

momran · August 30, 2018, 3:40am

@scottgray I have seen your walk through article example of SGEMM on Maxwell Architecture, it is very helpful but unfortunately I am not using Maxwell arch, do you have an idea on how to resolve bank conflicts using Nvidia’s assembler and if the operand collector is taking care of it?

Robert_Crovella · August 30, 2018, 4:01am

It’s hard to imagine what you’ve done with the code you posted. The shared tile variable types are int. That would create lots of weirdness at the SASS level, not to mention being functionally incorrect.

By the way, to make your posted code easier to read, you can select it and then click on the </> button at the top of the edit window. This will put your code in a more nicely formatted box.

momran · August 30, 2018, 4:09am

__global__ void gpu_square_matrix_mult(float *d_a, float *d_b, float *d_result, int n) 
{
__shared__ float tile_a[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float tile_b[BLOCK_SIZE][BLOCK_SIZE];

int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
float tmp = 0;
int idx;
int sub;
for ( sub = 0; sub < N/BLOCK_SIZE; ++sub) {
   idx = row * n + sub * BLOCK_SIZE + threadIdx.x;
   if(idx >= n*n){
      tile_a[threadIdx.y][threadIdx.x] = 0;
   }
   else{
       tile_a[threadIdx.y][threadIdx.x] = d_a[idx];
   }

   idx = (sub * BLOCK_SIZE + threadIdx.y) * n + col;
   if(idx >= n*n){
      tile_b[threadIdx.y][threadIdx.x] = 0;
   } 
   else{
   tile_b[threadIdx.y][threadIdx.x] = d_b[idx];
   }
__syncthreads();

for (int k = 0; k < BLOCK_SIZE; ++k) {
    tmp += tile_a[threadIdx.y][k] * tile_b[k][threadIdx.x];
}
__syncthreads();
}
if(row < n && col < n){
   d_result[row * n + col] = tmp;
}

}

@txbob you are right it should be float, however using int
still works but creates lots of instructions that converts int to float.

Robert_Crovella · August 30, 2018, 4:57am

not sure what “works” means. it would not give the correct answer for most float inputs

It’s also hard to imagine that you witnessed a FFMA instruction, but, whatever. I would have expected an integer multiply, not a floating-point multiply.

Robert_Crovella · August 30, 2018, 5:04am

Hopefully Scott Gray will come by and address your questions. I suspect, however, that if he had found a way to resolve this problem on C-level or any other technique, he would not have resorted to the herculean effort of creating his own assembler and crafting his assembly code.

Also, NVIDIA provides no assembler, that I am aware of.

momran · August 30, 2018, 5:12am

@txbob any techniques you would suggest to resolve bank conflicts?

Robert_Crovella · August 30, 2018, 5:15am

are you talking about shared memory bank conflicts? or are you referring to some aspect of register usage?

momran · August 30, 2018, 5:29am

@txbob Shared memory bank conflicts

Robert_Crovella · August 30, 2018, 7:01am

what are you using for BLOCK_SIZE ?

momran · August 30, 2018, 8:24pm

32 and 16.
But 32 is faster.
And this is my config.
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE/SIZE);
dim3 dimGrid(n/BLOCK_SIZE, n/BLOCK_SIZE);

momran · August 30, 2018, 8:26pm

@txbob
I also notice that my kernel is as fast as CUBLAS when n is equal or smaller than 256, but for bigger matrices, it is slower.

Robert_Crovella · August 31, 2018, 2:15am

For BLOCK_SIZE of 32, there should be no shared memory bank conflicts.

However this looks wrong to me:

dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE/SIZE);

it should be:

dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);

momran · August 31, 2018, 2:21am

Yes that’s a typo, It is dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
But I get a lot of execution dependencies for FFMA instructions when I profile it on NVVP as seen in the attachment.

Robert_Crovella · August 31, 2018, 4:17am

an execution dependency and a shared memory bank conflict are quite different concepts

you indicated you were asking about shared memory bank conflicts

if you want to find out whether there are shared memory bank conflicts, the profiler can tell you that

momran · August 31, 2018, 4:34am

So my new question is why do you think this code has execution dependency?

Robert_Crovella · August 31, 2018, 4:45am

[url]https://github.com/NervanaSystems/maxas/wiki/SGEMM[/url]

momran · August 31, 2018, 4:51am

Thank you, that was very helpful

Topic		Replies	Views
Shared memory matrix multiplication not working CUDA Programming and Performance	6	68	October 11, 2024
Matrix Multiplication and Bank conflicts code included CUDA Programming and Performance	3	1354	April 16, 2012
Bank conflicts confusion for tiled matrix multiplication CUDA Programming and Performance	1	272	February 27, 2024
Faster MatrixMult than CUBLAS! CUDA Programming and Performance	4	2801	September 4, 2009
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13307	January 24, 2009
Resolve 1D shared memory bank conflict with paddling CUDA Programming and Performance cuda , kernel	9	101	September 1, 2024
Matrix multiplication woes large inner, small outer dimensions CUDA Programming and Performance	21	10061	March 24, 2009
matrix multiplication CUDA Programming and Performance	10	3836	March 7, 2010
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	10845	January 16, 2025
Matrix-Matrix Multiplication Accuracy and Performance Questions CUDA Programming and Performance	13	6588	April 16, 2007

Matrix Multiplication Bank conflicts problem

Related topics