Curious about memory loads into cores

When vectors are loaded into the cuda cores, does the cuda hardware load/fill an entire block in one clock cycle? If not, how is that done, sequentially accessing 1 datum at a time?
I know the words are not exact cuda terminology but I hope I’m getting the question across. :)

CUDA Cores is the marketing term for the number of FP32 thread instructions (vs. warp instructions) that can be executed per cycle by the SM. The name and implementation of this execution unit varies per chip. On Volta - Turing it is called the FMA unit. The FMA unit accepts as input 1-3 float operands (scalar) from the register files (or constant buffer) and writes out 1 float operand per thread (e.g. FADD, FFMA, FMUL). In Volta - Turing the FMA unit has a warp instruction instruction issue rate of 1 warp instruction/cycle/SM sub-partition == 16 threads/cycle/SM sub-partition.

If you are asking about loading/storing data from memory then operations are on a warp not thread block (block = 1-32 warps on most architectures).

In one machine, I have a 1050 Ti, and in another, a 2070 Super. Have no clue which other family name those are, Volta, Turing, or whatever. The 1050 Ti is on Debian Linux, the 2070 Super I was forced to change that machine to Ubuntu 20.x just to get the GPU recognized. Also use an old GTX 960 but that machine is down at the moment.
And my question is primarily from board memory into the individual thread sections.