On the CUDA programming guide v4.2 section F.4.3.2, it says:
question 1: is “float” a typo? shouldn’t it be “double”?
question 2: does it imply that the 64 bit memory access request is for half-warp rather than the entire warp? otherwise, access to, say, shared[0] and shared [16] by thread 0 and 16 is supposed to incur bank conflict, right?