I understand that in the absence of coalesced accesses, i.e in a scattered access pattern, issuing 32B memory requests instead of 128B brings benefits in terms of bandwidth
-though i don’t know how much-. We have a code that does precisely that, scattered access over an array, and found that issuing 32B requests actually increases the time to process the array. The only explanation that i find is that this scattered access is not as wide as not to benefit, on a certain rate, from the 128B accesses. I wonder if that rate my compensate for the issuing of 128B accesses instead of 32B.
Are there any aspects which i’m leaving behind in my considerations that could explain that behaviour ? For example, considering that the stride is always greater than 128, would the time to access the array always reduced with 32B only ? In what percentage ?
Also, if there are suggestions about how to measure that, though i have some ideas that i’d try, i’d glad to hear about them -i’m not an experienced cuda programmer- :)