Batch load to utilize bandwidth instead of vector load


[1] shows a solution of vector load that utilize the bandwidth. You have in ptx for 4 loads, you should make sure that they are aligned.

I see that has many traits. Does any of them help to get same bandwidth utilization (ofc I will have 4 of ?


What performance difference do you observe using scalar loads vs vector loads? As long as the loads are properly coalesced, I would expect the difference in throughput to be minimal or even within measurement noise level (±%) on modern hardware.

If you look at the linked blog entry from almost a decade ago, even then the difference in throughput seems to have been at most about 15%. That effect mostly was due to the limited size of the load/store queue, as I recall.