Does this code cause bank conflicts?

#include <stdio.h>

const int N = 256;
__global__ void write_double(int* input) {
    __shared__ double data[N];

    if(threadIdx.x != 0) return;

    printf("%p %p\n", &data[0], &data[1]);
    int tid = threadIdx.x;
    data[tid] = input[tid];
}

int test1() {
    int* input;
    printf("sizeof double is %d\n", sizeof(double));

    int n = N;
    cudaMalloc(&input, n * sizeof(int));
    write_double<<<1, N>>>(input);
    cudaFree(input);
    return 0;
}

As far as I know, bank size is 4 byte, and sizeof(double) is 8, so thread 0 access address [0~7] in bank 0 and 1, and thread 16 access address [128~135] which is also in bank 0 and bank 1, is that the fact?

I think there should be bank conflicts but in my profiling, I can’t see it. I wonder why?

Thank you !

The understanding of bank width and banks accessed is correct.

The Load Store Unit (LSU) will break load/store instructions of >32-bit per thread into multiple wavefronts so it will show an increase in wavefronts; however, these are not reported a bank conflicts.

No bank conflicts here because when each thread access 8 byte, the 32 threads in a warp will be splited into two phases to do shared memory loading,
firstly, the T0~T15 thread will access the 32 bank data firstly,
then, the T16~T31 will access the next 32 bank data.

Similarly, when each thread access 16 byte, like float4, the 32 threads will be splited into four phase to do shared memory loading, and each phase will process a quater of a warp.

Thank you. I heard of this, is there any official documentation talking about this?

Is there any documentation mentioned this?