Global memory access

Here’s one simple question about global memory access performance.

Given two kernels which differ only in one line:

0: t = Table[ (X >> 12) & 0xFFFFF ];

1: t = Table[ (X >> 12) & 0xFFFF ];

and otherwise identical (memory access pattern, invocation domain, etc) will they differ in performance (in theory)?

Hopefully it should be the same unless there is some sort of hidden cache sitting somewhere… (0xFFFF will have limited memory range and hence better hit rate with some sort of cache sitting in-between)


Consider the memory reading sub-system. I would imagine it to have 2 channels…

One for read requests and one for write requests.

The following could be possible:

  1. A write-request can be used to directly serve the read-requests pending without going to global memory for the read.

  2. Pending read-requests data can be fulfilled from the “fetch buffer” that has just completed fetching data for a read-request.

If the second condition is true about NVIDIA hardware – then, the 0xFFFF case may work faster than 0xFFFFF case because of some locality of reference between read requests.

Programming manual says global memory isn’t cached.
To my understanding both should have similar performance.

Tests show that on most cards it is so, but on 9800GTX one with 0xFFFFF causes very-very strange performance drop.

What’s the element size? If I am doing the math correctly, the FFFFF case accesses a range of 1M elements. Assuming 4B elements, that’s only 4MB. I would have leaned towards thinking about TLB performance, but 4MB is a very small data range and therefore unlikely to cause TLB issues. What is the size of the memory region indexed by all threads collectively?


Yes, in one case table is 1M DWORDS and in other it is 64k DWORDS. All threads access this only table (and there’s no access pattern, all accesses are random).

The funny part is that this shows up ONLY on 9-seriaes cards. 8-series and 200-series are OK and lookup takes equal time for both cases.

If you’re interested, contact tech guys from NVIDIA Russia, I’ve sent a repro to them today. Or PM me your e-mail and I will send more detailed description and repro.

When you say 8 series, do you mean G80, or are you including the G92-based 8800 GTS in that?

I’ve tried on 8800GTX (G80) and 8800GT (not sure if it’s G92; it has 112 SP and 512 MB mem). With 8800GT perfrmance differs a little (maybe ~5%) but with 9800GTX difference is huge.

I was not talking about a cache either. I was just talking about a smart-mechanism sitting at the memory-queue level…