The increase of the shared memory size leads to the bankconflict (from 9 KB shared memory)

C.Leeeee · June 25, 2023, 5:03pm

My code shows that when the size of the shared memory reaches a threshold (8 KB),

the NSIGHT COMPUTE will report bank conflicts.

Quick question: would the bank conflict be influenced by the size of SHARED MEMORY?

See the code,

__global__ 
void loadmatrix(float* datain);


int main()
{
    float* mat;
    int row = 160;
    int col = 160;


    // three-dimensional matrix ( x y both 160 and z 10)
    // the 10 can be replaced by 7, 8 and 9 for further tests

    cudaMallocManaged(&mat, row*col*10*sizeof(float), cudaMemAttachGlobal);

    // initialize the 3d matrix randomly
    for(int l=0; l<10; ++l)
    {
        for (int i = 0; i < row; i++)
        {
            for(int j=0; j< col; ++j)
            {
                mat[l*row*col + i*col + j] = i*col + j*l;
            }
        }
    }
    
    dim3 numblock(10,10);
    dim3 numthread(16, 16);

    loadmatrix<<<numblock, numthread>>>(mat);

    cudaDeviceSynchronize();

    cudaFree(mat);

}
																																																																
__global__ 
void loadmatrix(float* mat)
{
    // dimension of the sub-matrix will be 16 x 16 x n (n from 7 - 10 is tested)
    __shared__ float sharedmat[16*16*10];

    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int idy = blockIdx.y*blockDim.y + threadIdx.y;

    int globalid = idy         * 160 + idx;
    int localid  = threadIdx.y * 16  + threadIdx.x;

    // read from global to shared
    // unroll manually for n = 10

    sharedmat[          localid] = mat[            globalid];
    sharedmat[  16*16 + localid] = mat[  160*160 + globalid];
    sharedmat[2*16*16 + localid] = mat[2*160*160 + globalid];
    sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid];
    sharedmat[4*16*16 + localid] = mat[4*160*160 + globalid];
    sharedmat[5*16*16 + localid] = mat[5*160*160 + globalid];
    sharedmat[6*16*16 + localid] = mat[6*160*160 + globalid];
    sharedmat[7*16*16 + localid] = mat[7*160*160 + globalid];
    sharedmat[8*16*16 + localid] = mat[8*160*160 + globalid];
    sharedmat[9*16*16 + localid] = mat[9*160*160 + globalid];

    __syncthreads();

    //Each element reads the left and right neighbours and sum them up with scale parameter

    int leftx = (threadIdx.x == 0 ? 0: threadIdx.x - 1);
    int rightx = (threadIdx.x == 15 ? 15 : threadIdx.x + 1);

    int localidxp = threadIdx.y*16 + leftx;
    int localidxm = threadIdx.y*16 + right;

    // Write the results to the global directly
    // Unroll manually for n = 10

    mat[globalid] = sharedmat[localid] + 0.8f * sharedmat[localidxp] 
                                       + 0.2f * sharedmat[localidxm];
    mat[1*160*160 + globalid] = sharedmat[1*16*16 + localid] + 0.8f * sharedmat[1*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[1*16*16 + localidxm];
    mat[2*160*160 + globalid] = sharedmat[2*16*16 + localid] + 0.8f * sharedmat[2*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[2*16*16 + localidxm];
    mat[3*160*160 + globalid] = sharedmat[3*16*16 + localid] + 0.8f * sharedmat[3*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[3*16*16 + localidxm];
    mat[4*160*160 + globalid] = sharedmat[4*16*16 + localid] + 0.8f * sharedmat[4*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[4*16*16 + localidxm];
    mat[5*160*160 + globalid] = sharedmat[5*16*16 + localid] + 0.8f * sharedmat[5*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[5*16*16 + localidxm];
    mat[6*160*160 + globalid] = sharedmat[6*16*16 + localid] + 0.8f * sharedmat[6*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[6*16*16 + localidxm];
    mat[7*160*160 + globalid] = sharedmat[7*16*16 + localid] + 0.8f * sharedmat[7*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[7*16*16 + localidxm];
    mat[8*160*160 + globalid] = sharedmat[8*16*16 + localid] + 0.8f * sharedmat[8*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[8*16*16 + localidxm];
    mat[9*160*160 + globalid] = sharedmat[9*16*16 + localid] + 0.8f * sharedmat[9*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[9*16*16 + localidxm];

}

Test

Ubuntu 18, sm = 7.5 (maybe the limitation of the shared memory is 48 KB and 32 banks)

nvcc -lineinfo -rdc=true -o matrixbank bank.cu

Memory

When the vertical (z-direction) dimension is 7 and 8, NSIGHT COMPUTE’s memory workload analysis reports 0 bank conflicts.

Number	Instructions	Wavefronts	Peak	Bank conflicts	L1 hit rate (%)	L2 hit rate (%)
7	22400	24100	14.08	0	40.21	49.66
8	25600	27300	13.85	0	37.34	49.82
9	28800	31064	11.53	377/162/69/608	18.59	49.85
10	32000	34568	12.73	663/314/8/985	15.68	49.72

Note: the content of bank conflicts is Shared Load/Shared Store/Other and Total.

Stall state

For case n = 10, the source counters reports the higher stall_long_sb for two lines

sharedmat[ localid] = mat[ globalid]; stall_long_sb = 17
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid]; stall_long_sb = 20

For case n = 9, the source counters reports the higher stall_long_sb for two lines

sharedmat[ localid] = mat[ globalid]; stall_long_sb = 29
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid]; stall_long_sb = 11

For case n = 8, the source counters reports the higher stall_long_sb for two lines

sharedmat[ localid] = mat[ globalid]; stall_long_sb = 18
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid]; stall_long_sb = 30

For case n = 7, the source counters reports the higher stall_long_sb for two lines

sharedmat[ localid] = mat[ globalid]; stall_long_sb = 18
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid]; stall_long_sb = 30

The value of the stall_mio is much smaller compared to stall_long_sb, ranging from 1-4.

The execution of instructions

The instruction statistics report the order of,

LDS > FFMA > STS = STG = LDG.

FFMA is around 66.66 % of LDS, while the STS, STG and LDG are about 33.33 % of LDS.

My conclusion

I believe the lower hit rate of the L1 cache may cause a higher scoreboard, and then bank conflict appears.

The question is:

since n = 7 and n = 8 also own the same level of bad performance of stall_state as n = 9 and n= 10, why did the bank conflict disappear? Does something relate to the shared block size? (8 KB vs 9KB)

My card has 48 KB of shared memory per block and 64 KB per multiprocessor, and the number of SM is 16.

The warp number per block is 8. Maybe the current configuration is out of the warp scheduler’s limitation and degrades the SM’s performance (I guess).

Thank you so much if you would correct me or post your suggestions!

jmarusarz · June 29, 2023, 8:31pm

Thanks for submitting this. There is a lot to unpack here. In order to get a better idea of where these bank conflicts are coming from, can you check the metrics ‘L1 Wavefronts Shared Excessive’ or ‘L1 Conflicts Shared N-Way’ from the Source Page when the report is collected with SourceCounters enabled?

C.Leeeee · June 30, 2023, 5:46am

I noticed that my previous Nsight compute version was so old (2020. xx), and I re-installed the Nsight compute 2023.1.0.

Moreover, here is the L1 Conflicts Shared N-Way.

And NO DATA of L1 Wavefronts Shared Excessive was provided by Source Counter.

Assume the Part 1 as:

C.Leeeee:

    sharedmat[          localid] = mat[            globalid];
    sharedmat[  16*16 + localid] = mat[  160*160 + globalid];
    sharedmat[2*16*16 + localid] = mat[2*160*160 + globalid];
    sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid];
    sharedmat[4*16*16 + localid] = mat[4*160*160 + globalid];
    sharedmat[5*16*16 + localid] = mat[5*160*160 + globalid];
    sharedmat[6*16*16 + localid] = mat[6*160*160 + globalid];
    sharedmat[7*16*16 + localid] = mat[7*160*160 + globalid];
    sharedmat[8*16*16 + localid] = mat[8*160*160 + globalid];
    sharedmat[9*16*16 + localid] = mat[9*160*160 + globalid];

The Part 2 as:

C.Leeeee:

    mat[globalid] = sharedmat[localid] + 0.8f * sharedmat[localidxp] 
                                       + 0.2f * sharedmat[localidxm];
    mat[1*160*160 + globalid] = sharedmat[1*16*16 + localid] + 0.8f * sharedmat[1*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[1*16*16 + localidxm];
    mat[2*160*160 + globalid] = sharedmat[2*16*16 + localid] + 0.8f * sharedmat[2*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[2*16*16 + localidxm];
    mat[3*160*160 + globalid] = sharedmat[3*16*16 + localid] + 0.8f * sharedmat[3*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[3*16*16 + localidxm];
    mat[4*160*160 + globalid] = sharedmat[4*16*16 + localid] + 0.8f * sharedmat[4*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[4*16*16 + localidxm];
    mat[5*160*160 + globalid] = sharedmat[5*16*16 + localid] + 0.8f * sharedmat[5*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[5*16*16 + localidxm];
    mat[6*160*160 + globalid] = sharedmat[6*16*16 + localid] + 0.8f * sharedmat[6*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[6*16*16 + localidxm];
    mat[7*160*160 + globalid] = sharedmat[7*16*16 + localid] + 0.8f * sharedmat[7*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[7*16*16 + localidxm];
    mat[8*160*160 + globalid] = sharedmat[8*16*16 + localid] + 0.8f * sharedmat[8*16*16 + localidxp] 
                                                              + 0.2f * sharedmat[8*16*16 + localidxm];
    mat[9*160*160 + globalid] = sharedmat[9*16*16 + localid] + 0.8f * sharedmat[9*16*16 + localidxp] 
                                                             + 0.2f * sharedmat[9*16*16 + localidxm];

New profiling data

Number	L1 Conflicts Shared N-Way (%)	L1 Conflicts Shared N-Way (Value)	Bank Conflicts	L1 Hit Rate (%)	L2 Hit Rate (%)
7	3.57 / 10.71	1 / 3	0	40.24	49.91
8	3.13 / 9.38	1 / 3	0	37.63	50.05
9	2.78 / 8.33	1 / 3	309/289/0/105 → 703	18.13	50.14
10	2.50 / 7.50	1 / 3	561/208/0/91 → 860	15.68	49.90

Note all the lines have the same L1 Conflicts Shared N-Way.
Contents 3.57/10.71 or 1/3 denote the N-Way for Part1/Part2.

The Part 1 reports the obviously higher stall_long_sb and stall_mio than the Part 2 (20 vs 2) for either n = 10 or n = 7.

jmarusarz · July 6, 2023, 8:22pm

The engineering team is asking for the reports to try and triage this. Are you able to attach the collected Nsight Compute reports?

C.Leeeee · July 8, 2023, 6:28am

Thank you for your reply!

The binary report has been attached, see the n_7 and n_10.ncu-rep.

Please contact me if you fail to open these files.

n_10.ncu-rep (154.8 KB)
n_7.ncu-rep (147.4 KB)

jmarusarz · July 14, 2023, 4:00pm

Thanks. I have filed a ticket with the engineering team to investigate this. I will let you know when I have more information. It may take some time, as the investigation has to be scheduled in parallel with the other development work.

Topic		Replies	Views
Optimizing bank conflicts - problem with occupancy CUDA Programming and Performance	12	2364	April 22, 2010
Shared Memory Bank Size mode 4 Byte VS 8 byte Kepler CUDA Programming and Performance	9	5053	May 27, 2015
Very strange share memory bank conflicts CUDA-MEMCHECK cuda	1	966	October 15, 2021
Avoiding shared memory bank conflicts CUDA Programming and Performance	3	3103	October 12, 2010
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3213	May 6, 2007
weird bank conflict when matrix transpose Nsight Compute	1	643	February 10, 2020
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2734	March 31, 2010
Shared memory problem CUDA Programming and Performance	10	4044	April 20, 2010
Does shared memory bank conflict really affect the performance? Shared memory bank conflict, perform CUDA Programming and Performance	2	1116	July 13, 2010
Little questions about conflict bank CUDA Programming and Performance	7	7448	April 26, 2011