I test bank conflict in my A800 GPU, the kernel code as follows:
__global__ void setRowReadRow(int * out)
{
unsigned int idx=threadIdx.y*blockDim.x+threadIdx.x;
__shared__ int shmem[32][32];
for(unsigned int m=0; m<1; m++){
out[idx] *= shmem[idx][m] ;
}
}
The block size is 32, grid size is 1, I used the ncu command to analyze the results, the command as follows:
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared, \
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
./bank_conflict 1
Here the result:
==PROF== Connected to process 449606 (/home/xxx/bank_conflict )
==PROF== Profiling "setRowReadRow(int *)" - 0: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 449606
==WARNING== Found outstanding GPU clock reset, trying to revert...Success.
[449606] shared_mem_config@127.0.0.1
setRowReadRow(int *) (1, 1, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
Section: Command line profiler metrics
-------------------------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------------------------------------- ----------- ------------
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 31
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.avg 0.29
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.max 31
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum 31
-------------------------------------------------------- ----------- ------------
It turned out just as I suspected, there are 32 bank in shared memory, and 32 thread load data in the bank, the number of loops is 1, so bank conflict is 31.
But I changed the number of the loop to 2, the result as follows:
==PROF== Connected to process 454331 (/home/heyuanhong/shared_mem_config)
==PROF== Profiling "setRowReadRow(int *)" - 0: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 454331
==WARNING== Found outstanding GPU clock reset, trying to revert...Success.
[454331] shared_mem_config@127.0.0.1
setRowReadRow(int *) (1, 1, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
Section: Command line profiler metrics
-------------------------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------------------------------------- ----------- ------------
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 30
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.avg 0.28
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.max 30
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum 30
-------------------------------------------------------- ----------- ------------
The bank conflict is 30, i wonder about the result. Because when the m=0
, 32 thread load data in bank 0 lead to 31 bank conflict, when the m=1
, 32 thread load data in bank 1 also lead to 31 bank conflict.
In my understanding, there should be 62 bank conflict, why the ncu result shows only 30 bank conflict?