float4 Shared memory doesn't yield bank conflict according to nvprof when it should

Current documentation seems to only now go back as far as CC5.0.

The information you quote is contained in the Programming Guide for Cuda 8:

" 128-Bit Accesses: The majority of 128-bit accesses will cause 2-way bank conflicts, even if no two threads in a quarter-warp access different addresses belonging to the same bank. Therefore, to determine the ways of bank conflicts, one must add 1 to the maximum number of threads in a quarter-warp that access different addresses belonging to the same bank."

1 Like