The increase of the shared memory size leads to the bankconflict (from 9 KB shared memory)

My code shows that when the size of the shared memory reaches a threshold (8 KB),

the NSIGHT COMPUTE will report bank conflicts.

Quick question: would the bank conflict be influenced by the size of SHARED MEMORY?

See the code,

__global__ 
void loadmatrix(float* datain);


int main()
{
    float* mat;
    int row = 160;
    int col = 160;


    // three-dimensional matrix ( x y both 160 and z 10)
    // the 10 can be replaced by 7, 8 and 9 for further tests

    cudaMallocManaged(&mat, row*col*10*sizeof(float), cudaMemAttachGlobal);

    // initialize the 3d matrix randomly
    for(int l=0; l<10; ++l)
    {
        for (int i = 0; i < row; i++)
        {
            for(int j=0; j< col; ++j)
            {
                mat[l*row*col + i*col + j] = i*col + j*l;
            }
        }
    }
    
    dim3 numblock(10,10);
    dim3 numthread(16, 16);

    loadmatrix<<<numblock, numthread>>>(mat);

    cudaDeviceSynchronize();

    cudaFree(mat);

}
																																																																
__global__ 
void loadmatrix(float* mat)
{
    // dimension of the sub-matrix will be 16 x 16 x n (n from 7 - 10 is tested)
    __shared__ float sharedmat[16*16*10];

    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int idy = blockIdx.y*blockDim.y + threadIdx.y;

    int globalid = idy         * 160 + idx;
    int localid  = threadIdx.y * 16  + threadIdx.x;

    // read from global to shared
    // unroll manually for n = 10

    sharedmat[          localid] = mat[            globalid];
    sharedmat[  16*16 + localid] = mat[  160*160 + globalid];
    sharedmat[2*16*16 + localid] = mat[2*160*160 + globalid];
    sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid];
    sharedmat[4*16*16 + localid] = mat[4*160*160 + globalid];
    sharedmat[5*16*16 + localid] = mat[5*160*160 + globalid];
    sharedmat[6*16*16 + localid] = mat[6*160*160 + globalid];
    sharedmat[7*16*16 + localid] = mat[7*160*160 + globalid];
    sharedmat[8*16*16 + localid] = mat[8*160*160 + globalid];
    sharedmat[9*16*16 + localid] = mat[9*160*160 + globalid];

    __syncthreads();

    //Each element reads the left and right neighbours and sum them up with scale parameter

    int leftx = (threadIdx.x == 0 ? 0: threadIdx.x - 1);
    int rightx = (threadIdx.x == 15 ? 15 : threadIdx.x + 1);

    int localidxp = threadIdx.y*16 + leftx;
    int localidxm = threadIdx.y*16 + right;

    // Write the results to the global directly
    // Unroll manually for n = 10

    mat[globalid] = sharedmat[localid] + 0.8f * sharedmat[localidxp] 
                                       + 0.2f * sharedmat[localidxm];
    mat[1*160*160 + globalid] = sharedmat[1*16*16 + localid] + 0.8f * sharedmat[1*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[1*16*16 + localidxm];
    mat[2*160*160 + globalid] = sharedmat[2*16*16 + localid] + 0.8f * sharedmat[2*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[2*16*16 + localidxm];
    mat[3*160*160 + globalid] = sharedmat[3*16*16 + localid] + 0.8f * sharedmat[3*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[3*16*16 + localidxm];
    mat[4*160*160 + globalid] = sharedmat[4*16*16 + localid] + 0.8f * sharedmat[4*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[4*16*16 + localidxm];
    mat[5*160*160 + globalid] = sharedmat[5*16*16 + localid] + 0.8f * sharedmat[5*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[5*16*16 + localidxm];
    mat[6*160*160 + globalid] = sharedmat[6*16*16 + localid] + 0.8f * sharedmat[6*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[6*16*16 + localidxm];
    mat[7*160*160 + globalid] = sharedmat[7*16*16 + localid] + 0.8f * sharedmat[7*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[7*16*16 + localidxm];
    mat[8*160*160 + globalid] = sharedmat[8*16*16 + localid] + 0.8f * sharedmat[8*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[8*16*16 + localidxm];
    mat[9*160*160 + globalid] = sharedmat[9*16*16 + localid] + 0.8f * sharedmat[9*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[9*16*16 + localidxm];

}

Test

Ubuntu 18, sm = 7.5 (maybe the limitation of the shared memory is 48 KB and 32 banks)

nvcc -lineinfo -rdc=true -o matrixbank bank.cu

Memory

When the vertical (z-direction) dimension is 7 and 8, NSIGHT COMPUTE’s memory workload analysis reports 0 bank conflicts.

Number Instructions Wavefronts Peak Bank conflicts L1 hit rate (%) L2 hit rate (%)
7 22400 24100 14.08 0 40.21 49.66
8 25600 27300 13.85 0 37.34 49.82
9 28800 31064 11.53 377/162/69/608 18.59 49.85
10 32000 34568 12.73 663/314/8/985 15.68 49.72

Note: the content of bank conflicts is Shared Load/Shared Store/Other and Total.

Stall state

For case n = 10, the source counters reports the higher stall_long_sb for two lines

sharedmat[ localid] = mat[ globalid]; stall_long_sb = 17
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid]; stall_long_sb = 20

For case n = 9, the source counters reports the higher stall_long_sb for two lines

sharedmat[ localid] = mat[ globalid]; stall_long_sb = 29
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid]; stall_long_sb = 11

For case n = 8, the source counters reports the higher stall_long_sb for two lines

sharedmat[ localid] = mat[ globalid]; stall_long_sb = 18
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid]; stall_long_sb = 30

For case n = 7, the source counters reports the higher stall_long_sb for two lines

sharedmat[ localid] = mat[ globalid]; stall_long_sb = 18
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid]; stall_long_sb = 30

The value of the stall_mio is much smaller compared to stall_long_sb, ranging from 1-4.

The execution of instructions

The instruction statistics report the order of,

LDS > FFMA > STS = STG = LDG.

FFMA is around 66.66 % of LDS, while the STS, STG and LDG are about 33.33 % of LDS.

My conclusion

I believe the lower hit rate of the L1 cache may cause a higher scoreboard, and then bank conflict appears.

The question is:

since n = 7 and n = 8 also own the same level of bad performance of stall_state as n = 9 and n= 10, why did the bank conflict disappear? Does something relate to the shared block size? (8 KB vs 9KB)

My card has 48 KB of shared memory per block and 64 KB per multiprocessor, and the number of SM is 16.

The warp number per block is 8. Maybe the current configuration is out of the warp scheduler’s limitation and degrades the SM’s performance (I guess).

Thank you so much if you would correct me or post your suggestions!

Thanks for submitting this. There is a lot to unpack here. In order to get a better idea of where these bank conflicts are coming from, can you check the metrics ‘L1 Wavefronts Shared Excessive’ or ‘L1 Conflicts Shared N-Way’ from the Source Page when the report is collected with SourceCounters enabled?

I noticed that my previous Nsight compute version was so old (2020. xx), and I re-installed the Nsight compute 2023.1.0.

Moreover, here is the L1 Conflicts Shared N-Way.

And NO DATA of L1 Wavefronts Shared Excessive was provided by Source Counter.

Assume the Part 1 as:

The Part 2 as:

New profiling data

Number L1 Conflicts Shared N-Way (%) L1 Conflicts Shared N-Way (Value) Bank Conflicts L1 Hit Rate (%) L2 Hit Rate (%)
7 3.57 / 10.71 1 / 3 0 40.24 49.91
8 3.13 / 9.38 1 / 3 0 37.63 50.05
9 2.78 / 8.33 1 / 3 309/289/0/105 → 703 18.13 50.14
10 2.50 / 7.50 1 / 3 561/208/0/91 → 860 15.68 49.90

Note all the lines have the same L1 Conflicts Shared N-Way.
Contents 3.57/10.71 or 1/3 denote the N-Way for Part1/Part2.

The Part 1 reports the obviously higher stall_long_sb and stall_mio than the Part 2 (20 vs 2) for either n = 10 or n = 7.

The engineering team is asking for the reports to try and triage this. Are you able to attach the collected Nsight Compute reports?

Thank you for your reply!

The binary report has been attached, see the n_7 and n_10.ncu-rep.

Please contact me if you fail to open these files.

n_10.ncu-rep (154.8 KB)
n_7.ncu-rep (147.4 KB)

Thanks. I have filed a ticket with the engineering team to investigate this. I will let you know when I have more information. It may take some time, as the investigation has to be scheduled in parallel with the other development work.