My code shows that when the size of the shared memory reaches a threshold (8 KB),
the NSIGHT COMPUTE will report bank conflicts.
Quick question: would the bank conflict be influenced by the size of SHARED MEMORY?
See the code,
__global__
void loadmatrix(float* datain);
int main()
{
float* mat;
int row = 160;
int col = 160;
// three-dimensional matrix ( x y both 160 and z 10)
// the 10 can be replaced by 7, 8 and 9 for further tests
cudaMallocManaged(&mat, row*col*10*sizeof(float), cudaMemAttachGlobal);
// initialize the 3d matrix randomly
for(int l=0; l<10; ++l)
{
for (int i = 0; i < row; i++)
{
for(int j=0; j< col; ++j)
{
mat[l*row*col + i*col + j] = i*col + j*l;
}
}
}
dim3 numblock(10,10);
dim3 numthread(16, 16);
loadmatrix<<<numblock, numthread>>>(mat);
cudaDeviceSynchronize();
cudaFree(mat);
}
__global__
void loadmatrix(float* mat)
{
// dimension of the sub-matrix will be 16 x 16 x n (n from 7 - 10 is tested)
__shared__ float sharedmat[16*16*10];
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int idy = blockIdx.y*blockDim.y + threadIdx.y;
int globalid = idy * 160 + idx;
int localid = threadIdx.y * 16 + threadIdx.x;
// read from global to shared
// unroll manually for n = 10
sharedmat[ localid] = mat[ globalid];
sharedmat[ 16*16 + localid] = mat[ 160*160 + globalid];
sharedmat[2*16*16 + localid] = mat[2*160*160 + globalid];
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid];
sharedmat[4*16*16 + localid] = mat[4*160*160 + globalid];
sharedmat[5*16*16 + localid] = mat[5*160*160 + globalid];
sharedmat[6*16*16 + localid] = mat[6*160*160 + globalid];
sharedmat[7*16*16 + localid] = mat[7*160*160 + globalid];
sharedmat[8*16*16 + localid] = mat[8*160*160 + globalid];
sharedmat[9*16*16 + localid] = mat[9*160*160 + globalid];
__syncthreads();
//Each element reads the left and right neighbours and sum them up with scale parameter
int leftx = (threadIdx.x == 0 ? 0: threadIdx.x - 1);
int rightx = (threadIdx.x == 15 ? 15 : threadIdx.x + 1);
int localidxp = threadIdx.y*16 + leftx;
int localidxm = threadIdx.y*16 + right;
// Write the results to the global directly
// Unroll manually for n = 10
mat[globalid] = sharedmat[localid] + 0.8f * sharedmat[localidxp]
+ 0.2f * sharedmat[localidxm];
mat[1*160*160 + globalid] = sharedmat[1*16*16 + localid] + 0.8f * sharedmat[1*16*16 + localidxp]
+ 0.2f * sharedmat[1*16*16 + localidxm];
mat[2*160*160 + globalid] = sharedmat[2*16*16 + localid] + 0.8f * sharedmat[2*16*16 + localidxp]
+ 0.2f * sharedmat[2*16*16 + localidxm];
mat[3*160*160 + globalid] = sharedmat[3*16*16 + localid] + 0.8f * sharedmat[3*16*16 + localidxp]
+ 0.2f * sharedmat[3*16*16 + localidxm];
mat[4*160*160 + globalid] = sharedmat[4*16*16 + localid] + 0.8f * sharedmat[4*16*16 + localidxp]
+ 0.2f * sharedmat[4*16*16 + localidxm];
mat[5*160*160 + globalid] = sharedmat[5*16*16 + localid] + 0.8f * sharedmat[5*16*16 + localidxp]
+ 0.2f * sharedmat[5*16*16 + localidxm];
mat[6*160*160 + globalid] = sharedmat[6*16*16 + localid] + 0.8f * sharedmat[6*16*16 + localidxp]
+ 0.2f * sharedmat[6*16*16 + localidxm];
mat[7*160*160 + globalid] = sharedmat[7*16*16 + localid] + 0.8f * sharedmat[7*16*16 + localidxp]
+ 0.2f * sharedmat[7*16*16 + localidxm];
mat[8*160*160 + globalid] = sharedmat[8*16*16 + localid] + 0.8f * sharedmat[8*16*16 + localidxp]
+ 0.2f * sharedmat[8*16*16 + localidxm];
mat[9*160*160 + globalid] = sharedmat[9*16*16 + localid] + 0.8f * sharedmat[9*16*16 + localidxp]
+ 0.2f * sharedmat[9*16*16 + localidxm];
}
Test
Ubuntu 18, sm = 7.5 (maybe the limitation of the shared memory is 48 KB and 32 banks)
nvcc -lineinfo -rdc=true -o matrixbank bank.cu
Memory
When the vertical (z-direction) dimension is 7 and 8, NSIGHT COMPUTE’s memory workload analysis reports 0 bank conflicts.
Number | Instructions | Wavefronts | Peak | Bank conflicts | L1 hit rate (%) | L2 hit rate (%) |
---|---|---|---|---|---|---|
7 | 22400 | 24100 | 14.08 | 0 | 40.21 | 49.66 |
8 | 25600 | 27300 | 13.85 | 0 | 37.34 | 49.82 |
9 | 28800 | 31064 | 11.53 | 377/162/69/608 | 18.59 | 49.85 |
10 | 32000 | 34568 | 12.73 | 663/314/8/985 | 15.68 | 49.72 |
Note: the content of bank conflicts is Shared Load/Shared Store/Other and Total.
Stall state
For case n = 10, the source counters reports the higher stall_long_sb for two lines
sharedmat[ localid] = mat[ globalid];
stall_long_sb = 17
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid];
stall_long_sb = 20
For case n = 9, the source counters reports the higher stall_long_sb for two lines
sharedmat[ localid] = mat[ globalid];
stall_long_sb = 29
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid];
stall_long_sb = 11
For case n = 8, the source counters reports the higher stall_long_sb for two lines
sharedmat[ localid] = mat[ globalid];
stall_long_sb = 18
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid];
stall_long_sb = 30
For case n = 7, the source counters reports the higher stall_long_sb for two lines
sharedmat[ localid] = mat[ globalid];
stall_long_sb = 18
sharedmat[3*16*16 + localid] = mat[3*160*160 + globalid];
stall_long_sb = 30
The value of the stall_mio is much smaller compared to stall_long_sb, ranging from 1-4.
The execution of instructions
The instruction statistics report the order of,
LDS > FFMA > STS = STG = LDG.
FFMA is around 66.66 % of LDS, while the STS, STG and LDG are about 33.33 % of LDS.
My conclusion
I believe the lower hit rate of the L1 cache may cause a higher scoreboard, and then bank conflict appears.
The question is:
since n = 7 and n = 8 also own the same level of bad performance of stall_state as n = 9 and n= 10, why did the bank conflict disappear? Does something relate to the shared block size? (8 KB vs 9KB)
My card has 48 KB of shared memory per block and 64 KB per multiprocessor, and the number of SM is 16.
The warp number per block is 8. Maybe the current configuration is out of the warp scheduler’s limitation and degrades the SM’s performance (I guess).
Thank you so much if you would correct me or post your suggestions!