Hello! I am new to CUDA, and I found the cool trick (in sources of tcnn) to copy arrays of any type via int4
representation and simultanious copy of 128 bits of data:
int i = threadIdx.x * sizeof(int4) / sizeof(T); // T is the type of in and out arrays
*(int4*)&out[i] = *(int4*)&in[i];
But what I do not quite understand is why there is no memory bank conflicts (suppose in
and/or out
are/is array(s) in the shared memory)? I mean if instead of int4
we would use some structure, like
struct myint4 {
int32_t x, y, z, w;
};
Then we would face the problem of reading memory 4-way when accessing .x
argument (for example) in all threads, so 0th, 8th, 16th and 24th threads in a warp would hit the same memory bank and I suppose it is much worse then copy elements one-by-one (or make the same trick but with uint32_t, and copy 32 bits per thread). However, it looks like it is actually faster.
Actually in my example the arrays are padded with 16 bytes padding each matrix row (it is 32, 64, 128 or 256 bytes, depending of WIDTH
template parameter in original code). But as I understand to avoid bank conflicts the padding should be k * 4 bytes each 128 bytes, where k is coprime with 32 (or just one).