Hopefully it should be the same unless there is some sort of hidden cache sitting somewhere… (0xFFFF will have limited memory range and hence better hit rate with some sort of cache sitting in-between)
Consider the memory reading sub-system. I would imagine it to have 2 channels…
One for read requests and one for write requests.
The following could be possible:
A write-request can be used to directly serve the read-requests pending without going to global memory for the read.
Pending read-requests data can be fulfilled from the “fetch buffer” that has just completed fetching data for a read-request.
If the second condition is true about NVIDIA hardware – then, the 0xFFFF case may work faster than 0xFFFFF case because of some locality of reference between read requests.
What’s the element size? If I am doing the math correctly, the FFFFF case accesses a range of 1M elements. Assuming 4B elements, that’s only 4MB. I would have leaned towards thinking about TLB performance, but 4MB is a very small data range and therefore unlikely to cause TLB issues. What is the size of the memory region indexed by all threads collectively?