Shared memory doesn’t have cache lines at all.
There’s a big difference between cache lines and banks. Cache lines are groups of 128 contiguous bytes aligned to 128 byte boundaries. GF100 copies values from device memory to L2 and from L2 to L1 in these 128 byte chunks.
Banks are unrelated to cache lines. GF100 shared/L1 has 32 banks, each corresponding to the low order words of the address. At each instruction tick, each bank can output its 4-byte-word to one thread who requests a word with that low order address. If two threads request different memory with the same low order word addresses, the bank can only service one of them, and another instruction tick is needed to service the next… they’re serialized. (There’s an exception for a single word broadcast of the identical address though.)
Your quote from the programming manual discusses a different topic, multi-word accesses by threads. L1 behaves just like shared memory in this case, with identical inevitable bank conflicts.