How `int4` vectors are copied without shared memory bank conflicts?

Hello! I am new to CUDA, and I found the cool trick (in sources of tcnn) to copy arrays of any type via int4 representation and simultanious copy of 128 bits of data:

int i  = threadIdx.x * sizeof(int4) / sizeof(T); // T is the type of in and out arrays
*(int4*)&out[i] = *(int4*)&in[i];

But what I do not quite understand is why there is no memory bank conflicts (suppose in and/or out are/is array(s) in the shared memory)? I mean if instead of int4 we would use some structure, like

struct myint4 {
    int32_t x, y, z, w;

Then we would face the problem of reading memory 4-way when accessing .x argument (for example) in all threads, so 0th, 8th, 16th and 24th threads in a warp would hit the same memory bank and I suppose it is much worse then copy elements one-by-one (or make the same trick but with uint32_t, and copy 32 bits per thread). However, it looks like it is actually faster.

Actually in my example the arrays are padded with 16 bytes padding each matrix row (it is 32, 64, 128 or 256 bytes, depending of WIDTH template parameter in original code). But as I understand to avoid bank conflicts the padding should be k * 4 bytes each 128 bytes, where k is coprime with 32 (or just one).