The limitations can be directly inferred from the statements already provided. If two (or more) threads in a warp, for a single warp-issued instruction, access different 32-bit locations in the same bank, regardless of their access width (8-bit, 16-bit, or 32-bit), then that will result in a bank conflict.
However there is the broadcast rule. Two or more threads accessing the same location do not generate a bank conflict. This is also true regardless of the access width (8-bit, 16-bit, or 32-bit) and is even true if, for example, different 8-bit locations in the same 32-bit word are being accessed by different threads.
Coupled with that, as already stated, the maximum bandwidth of shared memory (I believe, post-Kepler) is one 32 bit quantity/location per bank, per cycle (per SM). Accessing 8-bit quantities or 16-bit quantities per thread will necessarily reduce the maximum achievable bandwidth by a factor of one-half (16-bit) or one-quarter (8-bit).
There are also other chip-specific factors which may impact whether a full bandwidth of 128 bytes per access (32 threads in a warp times 32 bits per thread) can be achieved.
I’m not sure what you mean. 8-bit or 16-bit access is supported in a fashion similar to 32-bit, as already covered in my comments above. At some point, I cannot proceed any further with questions of the form “why is it this way?” I have given a behavioral description. Beyond that, I will eventually end up at the answer “because that is the way the GPU designers chose to design it”.