What performance difference do you observe using scalar loads vs vector loads? As long as the loads are properly coalesced, I would expect the difference in throughput to be minimal or even within measurement noise level (±%) on modern hardware.
If you look at the linked blog entry from almost a decade ago, even then the difference in throughput seems to have been at most about 15%. That effect mostly was due to the limited size of the load/store queue, as I recall.