Bank conflicts occurs in accessing shared memory during load/store. When I use code shown as follows:
__global__ void
bank_conf(const int* dev_a, int size) {
extern __shared__ int cache;
int tidx = blockIdx.x * blockDim.x + threadIdx.x;
cache[tidx * 2] += dev_a[tidx];
}
The results of some metrics tested in my program shown as follows:
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__data_bank_conflicts_pipe_lsu_mem_global.avg 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global.max 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.avg 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.max 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.avg 0.05
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.max 2
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum 2
---------------------------------------------------------------------- --------------- ------------------------------
Can anyone help me to explain why the metrics called l1tex__data_bank_conflicts_pipe_lsu_mem_global and l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st do not work, and how to understand metrics like them?
l1tex__data_bank_conflicts_pipe_lsu_mem_global measures the number of data bank conflicts generated by global operations.
l1tex__data_bank_conflicts_pipe_lsu_mem_shared measures the number of shared memory data bank conflicts generated by LDS, LD, 3D attribute loads, LDSM, STS, ST, ATOMS, ATOM, 3D attribute stores, LDGSTS and Misc.
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st measures the number of data bank conflicts generated by global stores.
I see no reason that your output indicates that the former mem_global metric does not work. There appear to be conflicts for your writes to shared memory (cache), which are tracked in the mem_shared metric. There aren’t any conflicts for global loads (dev_a), since the accessed are coalesced. Therefore, the mem_global metric is 0. Your code does not have any stores to global memory, so the global_op_st metric is 0, too.
For a more intuitive display of these values, I recommend you collect the SourceCounters section and inspect it on the UI’s Details and Source pages.
Hi, felix.
I just saw this post and tried some tests for bank conflicts of global memory. Here is my source code.
#include <cuda_runtime.h>
#include <iostream>
__global__ void metrics_kernel(float *src, float *dst, int count, int N) {
int id_x = blockIdx.x * blockDim.x + threadIdx.x;
if (id_x >= count) {
return;
}
dst[id_x * N] = src[id_x * N];
}
int main(int argc, char *argv[]) {
int N = 1;
if (argc == 2) {
N = atoi(argv[1]);
}
const int kCount = 32;
// alloc N times the memory space for stride load
int size = kCount * sizeof(float) * N;
float *src = static_cast<float *>(malloc(size));
float *dst = static_cast<float *>(malloc(size));
for (int i = 0; i < kCount * N; ++i) {
src[i] = i;
}
float *src_dev = nullptr;
float *dst_dev = nullptr;
cudaMalloc(&src_dev, size);
cudaMalloc(&dst_dev, size);
cudaMemcpy(src_dev, src, size, cudaMemcpyHostToDevice);
dim3 block(32, 1);
dim3 grid(1, 1);
metrics_kernel<<<grid, block>>>(src_dev, dst_dev, kCount, N);
cudaStreamSynchronize(0);
cudaFree(src_dev);
cudaFree(dst_dev);
free(src);
free(dst);
return 0;
}
Compile command is nvcc -arch=sm_86
. Then test it on 3080 card
When i run the command ./a.out 1
which has a nice coalesced access, ncu report zero of metrics value.
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__data_bank_conflicts_pipe_lsu_mem_global.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_ld.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.sum 0
---------------------------------------------------------------------- --------------- ------------------------------
Then i run the command ./a.out 2
which make stride step is 2, and ncu report non-zero value as follow.
./a.out 2
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__data_bank_conflicts_pipe_lsu_mem_global.sum 2
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_ld.sum 1
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.sum 1
---------------------------------------------------------------------- --------------- ------------------------------
./a.out 3
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__data_bank_conflicts_pipe_lsu_mem_global.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_ld.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.sum 0
---------------------------------------------------------------------- --------------- ------------------------------
But i found stride step of 3 has no bank conflict of global memory.
Then i test 4, 5, 6 …
-
I found even correspond to bank conflict, but odd not.
-
Next, I test metrics
l1tex__data_pipe_lsu_wavefronts_mem_lg_cmd_read
,
i found even correspond to multi wavefronts, and odd correspond to only one wavefront.
Why I can get results as 1. ?
And is there a correlation between 1. and 2. ?
Thank you very much.