Ncu detects bank conflicts in matrix transposition after padding

In matrix transposition, I first load data from global memory to shared memory. In order to avoid bank conflict when loading data from shared memory, I use elements to fill the column of shared memory. The transpose kernel is as follows:

constexpr int offset = 1;
template<int TX, int TY>
__global__ void transpose_kernel(float *in, float *out, int in_height, int in_width)
{
    __shared__ float shmm[TY * (TX+offset)]; // 512 * 4 = 2048 2k
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int tx = threadIdx.y * blockDim.x + threadIdx.x;
    if (row < in_height & col < in_width) {
        shmm[threadIdx.y * (TX + offset) + threadIdx.x] = in[row * in_width + col];
    }
    __syncthreads();

    int row_i = tx / blockDim.y;
    int col_i = tx % blockDim.y;
    col = blockIdx.y * blockDim.y + col_i;
    row = blockIdx.x * blockDim.x + row_i;
    if (row < in_width & col < in_height) {
        out[row * in_height + col] = shmm[col_i * (TX + offset) + row_i];
        // if (tx / 32 != 0 && tx % 32 == (col_i * (TX + offset) + row_i)%32)
            // printf("tx[%d] (%d %d) bank %d\n",tx, tx / 32, tx % 32, (col_i * (TX + offset) + row_i)%32);
    }
}

If a bank size is equal to 4 bytes and a warp can access 32 banks, filling a column (offset equal to 1) can avoid bank conflict. However, as shown in the figure below, when the offset is equal to 1, although the bank conflict is weakened, it can still be detected.

Here is the source code and ncu profile file:
save.tar.gz (58.5 KB)

Thanks for reaching out. I’m not sure what your question is? Are you expecting to see zero bank conflicts with your padded implementation?

yes!

Hi zhaopeng_eng,

your understanding of the organization and the shared memory banks seems all correct to me. However, the issue you are seeing is caused by which threads are actually grouped together in a warp. For your block layout of TX=32 and TY=16 you end up with each warp executing one row block.

To print the thread indices, banks, and bank offsets for your code sample for the first warp of the first block, I changed your existing commented debug output to:

if (blockIdx.x == 0 && blockIdx.y == 0 && blockIdx.z == 0 && threadIdx.x < 32 && threadIdx.y == 0) {
    const auto index = col_i * (TX + offset) + row_i;
    printf("tx[%2d]: %3d bank:%2d offset:%d\n", threadIdx.x, index, index%32, index/32);
}

For the first variant of your code with offset = 0, the printed output is shown in the following:

tx[ 0]:   0 bank: 0 offset:0
tx[ 1]:  32 bank: 0 offset:1
tx[ 2]:  64 bank: 0 offset:2
tx[ 3]:  96 bank: 0 offset:3
tx[ 4]: 128 bank: 0 offset:4
tx[ 5]: 160 bank: 0 offset:5
tx[ 6]: 192 bank: 0 offset:6
tx[ 7]: 224 bank: 0 offset:7
tx[ 8]: 256 bank: 0 offset:8
tx[ 9]: 288 bank: 0 offset:9
tx[10]: 320 bank: 0 offset:10
tx[11]: 352 bank: 0 offset:11
tx[12]: 384 bank: 0 offset:12
tx[13]: 416 bank: 0 offset:13
tx[14]: 448 bank: 0 offset:14
tx[15]: 480 bank: 0 offset:15
tx[16]:   1 bank: 1 offset:0
tx[17]:  33 bank: 1 offset:1
tx[18]:  65 bank: 1 offset:2
tx[19]:  97 bank: 1 offset:3
tx[20]: 129 bank: 1 offset:4
tx[21]: 161 bank: 1 offset:5
tx[22]: 193 bank: 1 offset:6
tx[23]: 225 bank: 1 offset:7
tx[24]: 257 bank: 1 offset:8
tx[25]: 289 bank: 1 offset:9
tx[26]: 321 bank: 1 offset:10
tx[27]: 353 bank: 1 offset:11
tx[28]: 385 bank: 1 offset:12
tx[29]: 417 bank: 1 offset:13
tx[30]: 449 bank: 1 offset:14
tx[31]: 481 bank: 1 offset:15

As you also stated in your comment, this pattern leads to a lot of bank conflicts. More specifically it is a 16-way bank conflict for each warp. We only access 2 banks and have to read 16 unique bank offsets from each of the two banks. If you look on the Source Page, the kernel executes a shared memory load for 655,360 warps. For a 16-way bank conflict that ends up being 16 * 655,360 = 10,485,760 L1 Wavefronts Shared. We classify 15 out of the 16 wavefronts excessive as an ideal memory pattern for this access would only take a single wavefront. Consequently, the metric L1 Wavefronts Shared Excessive ends up being 15 * 655,360 = 9,830,400.

Repeating the same analysis for your code with offset = 1, leads to the following output:

tx[ 0]:   0 bank: 0 offset:0
tx[ 1]:  33 bank: 1 offset:1
tx[ 2]:  66 bank: 2 offset:2
tx[ 3]:  99 bank: 3 offset:3
tx[ 4]: 132 bank: 4 offset:4
tx[ 5]: 165 bank: 5 offset:5
tx[ 6]: 198 bank: 6 offset:6
tx[ 7]: 231 bank: 7 offset:7
tx[ 8]: 264 bank: 8 offset:8
tx[ 9]: 297 bank: 9 offset:9
tx[10]: 330 bank:10 offset:10
tx[11]: 363 bank:11 offset:11
tx[12]: 396 bank:12 offset:12
tx[13]: 429 bank:13 offset:13
tx[14]: 462 bank:14 offset:14
tx[15]: 495 bank:15 offset:15
tx[16]:   1 bank: 1 offset:0
tx[17]:  34 bank: 2 offset:1
tx[18]:  67 bank: 3 offset:2
tx[19]: 100 bank: 4 offset:3
tx[20]: 133 bank: 5 offset:4
tx[21]: 166 bank: 6 offset:5
tx[22]: 199 bank: 7 offset:6
tx[23]: 232 bank: 8 offset:7
tx[24]: 265 bank: 9 offset:8
tx[25]: 298 bank:10 offset:9
tx[26]: 331 bank:11 offset:10
tx[27]: 364 bank:12 offset:11
tx[28]: 397 bank:13 offset:12
tx[29]: 430 bank:14 offset:13
tx[30]: 463 bank:15 offset:14
tx[31]: 496 bank:16 offset:15

From the output we can see that we now read from more banks, but we still have to read 2 unique values for most of the accessed banks. In other words, we now have a 2-way bank conflict for each warp. Looking at the number again on the Source Page for this access, we end up with 2 * 655,360 = 1,310,720 overall L1 Wavefronts Shared. One out of the two wavefronts is considered excessive, which leaves us with 655,360 L1 Wavefronts Shared Excessive. For comparison, this is the output from the report in this case:

I hope that helps to explain the numbers in the report and why you are seeing the reduction in wavefronts when switching between the variants. In order to get to zero bank conflicts for this sample, you need to further change the indexing calculation so that you access all 32 banks for the 32 threads of each warp.

1 Like

Hi, mstrengert
Thank you very much for your detailed introduction. According to your introduction, I set TY to 32. In theory, when offset=1, bank conflict free can be achieved. As shown blow, the value of L1 Wavefronts Shared Excessive is zero

But in section Memory Workload Analysis, there are 108191 bank conflicts. Can you explain the reason why there are still bank conflicts?


Here is the ncu file:
opt_ty32.ncu-rep (151.0 KB)

Hi zhaopeng_eng,

there is indeed a difference in the reported bank conflicts on the Source Page versus the Details Page.

On the Source Page the reported bank conflicts solely originate from the memory access pattern of the corresponding source line. For every executed shared memory access, we calculate the conflicts within the warp due to the access pattern for the active threads of the warp. For your updated code sample, this is now reduced to zero.

The reported bank conflicts on the Details Page include all these conflicts plus additional conflicts that are caused by multiple clients trying to access the memory banks at the same time. For more details, please also have a look at How to Understand and Optimize Shared Memory Accesses using Nsight Compute | NVIDIA On-Demand. The difference and root cause are briefly discussed around minute 21 in the recording. In short, as the L1 Cache and Shared Memory are both backed by the same physical memory banks, there may be additional conflicts across warps from different clients accessing this physical memory. The numbers on the Details Page include these additional conflicts.

1 Like