Batch load to utilize bandwidth instead of vector load

Hi,

[1] shows a solution of vector load that utilize the bandwidth. You have ld.global.v4 in ptx for 4 loads, you should make sure that they are aligned.

I see that ld.global has many traits. Does any of them help to get same bandwidth utilization (ofc I will have 4 of ld.global) ?

[1]https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

What performance difference do you observe using scalar loads vs vector loads? As long as the loads are properly coalesced, I would expect the difference in throughput to be minimal or even within measurement noise level (±%) on modern hardware.

If you look at the linked blog entry from almost a decade ago, even then the difference in throughput seems to have been at most about 15%. That effect mostly was due to the limited size of the load/store queue, as I recall.