How to profile such metrics like l1tex__data_bank_conflicts_pipe_lsu_mem_global?

Bank conflicts occurs in accessing shared memory during load/store. When I use code shown as follows:
__global__ void
bank_conf(const int* dev_a, int size) {
extern __shared__ int cache;
int tidx = blockIdx.x * blockDim.x + threadIdx.x;
cache[tidx * 2] += dev_a[tidx];
}
The results of some metrics tested in my program shown as follows:
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__data_bank_conflicts_pipe_lsu_mem_global.avg 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global.max 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.avg 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.max 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.avg 0.05
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.max 2
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum 2
---------------------------------------------------------------------- --------------- ------------------------------
Can anyone help me to explain why the metrics called l1tex__data_bank_conflicts_pipe_lsu_mem_global and l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st do not work, and how to understand metrics like them?

l1tex__data_bank_conflicts_pipe_lsu_mem_global measures the number of data bank conflicts generated by global operations.

l1tex__data_bank_conflicts_pipe_lsu_mem_shared measures the number of shared memory data bank conflicts generated by LDS, LD, 3D attribute loads, LDSM, STS, ST, ATOMS, ATOM, 3D attribute stores, LDGSTS and Misc.

l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st measures the number of data bank conflicts generated by global stores.

I see no reason that your output indicates that the former mem_global metric does not work. There appear to be conflicts for your writes to shared memory (cache), which are tracked in the mem_shared metric. There aren’t any conflicts for global loads (dev_a), since the accessed are coalesced. Therefore, the mem_global metric is 0. Your code does not have any stores to global memory, so the global_op_st metric is 0, too.

For a more intuitive display of these values, I recommend you collect the SourceCounters section and inspect it on the UI’s Details and Source pages.

Hi, felix.
I just saw this post and tried some tests for bank conflicts of global memory. Here is my source code.

#include <cuda_runtime.h>
#include <iostream>

__global__ void metrics_kernel(float *src, float *dst, int count, int N) {
  int id_x = blockIdx.x * blockDim.x + threadIdx.x;

  if (id_x >= count) {
    return;
  }

  dst[id_x * N] = src[id_x * N];
}

int main(int argc, char *argv[]) {
  int N = 1;

  if (argc == 2) {
    N = atoi(argv[1]);
  }

  const int kCount = 32;

  // alloc N times the memory space for stride load
  int size = kCount * sizeof(float) * N;

  float *src = static_cast<float *>(malloc(size));
  float *dst = static_cast<float *>(malloc(size));

  for (int i = 0; i < kCount * N; ++i) {
    src[i] = i;
  }

  float *src_dev = nullptr;
  float *dst_dev = nullptr;

  cudaMalloc(&src_dev, size);
  cudaMalloc(&dst_dev, size);

  cudaMemcpy(src_dev, src, size, cudaMemcpyHostToDevice);

  dim3 block(32, 1);
  dim3 grid(1, 1);

  metrics_kernel<<<grid, block>>>(src_dev, dst_dev, kCount, N);

  cudaStreamSynchronize(0);

  cudaFree(src_dev);
  cudaFree(dst_dev);

  free(src);
  free(dst);

  return 0;
}

Compile command is nvcc -arch=sm_86. Then test it on 3080 card
When i run the command ./a.out 1 which has a nice coalesced access, ncu report zero of metrics value.

    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__data_bank_conflicts_pipe_lsu_mem_global.sum                                                                  0
    l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_ld.sum                                                            0
    l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.sum                                                            0
    ---------------------------------------------------------------------- --------------- ------------------------------

Then i run the command ./a.out 2which make stride step is 2, and ncu report non-zero value as follow.

./a.out 2
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__data_bank_conflicts_pipe_lsu_mem_global.sum                                                                  2
    l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_ld.sum                                                            1
    l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.sum                                                            1
    ---------------------------------------------------------------------- --------------- ------------------------------

./a.out 3
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__data_bank_conflicts_pipe_lsu_mem_global.sum                                                                  0
    l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_ld.sum                                                            0
    l1tex__data_bank_conflicts_pipe_lsu_mem_global_op_st.sum                                                            0
    ---------------------------------------------------------------------- --------------- ------------------------------

But i found stride step of 3 has no bank conflict of global memory.
Then i test 4, 5, 6 …

  1. I found even correspond to bank conflict, but odd not.

  2. Next, I test metrics l1tex__data_pipe_lsu_wavefronts_mem_lg_cmd_read,
    i found even correspond to multi wavefronts, and odd correspond to only one wavefront.

Why I can get results as 1. ?
And is there a correlation between 1. and 2. ?

Thank you very much.