There is no concept of coalescing applied to a single thread. Coalescing refers to address grouping behavior considered warp-wide.
The maximum amount of consecutive data that can be loaded by a single thread in a single instruction is 16 bytes. I usually refer to larger-than-8-byte loads as a vector load, because it requires using a vector type (e.g. int4, double2, etc.) to achieve it (referring to CUDA C++, anyway).
It’s generally not wise to think of operations using only a single thread in CUDA. Even in the case of a 16-byte “vector load”, there are still efficiencies to be gained doing this warp-wide, and using the data pulled in by an entire warp (hopefully in a coalesced fashion).
Constructing isolated operations to be performed by a single thread is generally not an efficient use of the machine, which is why you commonly see the “cooperative load and shared memory hassle” or other approaches to make use of operations warp-wide, at a minimum.