Inquiry About GPU On-Device-TLB Behavior

Hello,

I’m running some experiments on H100 with large device allocations (cudaMalloc) and memory copy to device and I’m trying to understand how page access overhead works (so the page fault overhead). My assumption was that the first access to a page might be more expensive due to page faults, but that if the page is touched before, subsequent accesses should be faster.

To test this, I tried touching pages in advance (using simple global loads, both cached and non-cached) before running my main compute kernel. As an example, for a matrix I would go in page-wise strides. I expected this warm-up step to reduce kernel execution time because the pages would be present in the TLB, but I don’t see any improvement.

Details:

  • Memory is device-only (no UVM).

  • Assumed 2 MB page granularity. (This is because I read that the GPU prefers 2MB pages)

  • Tried both cached and non-cached loads for warming up.

Is there something about how H100 handles page translation or caching that would explain why warming up pages doesn’t affect kernel performance? Am I doing something wrong in terms of trying cache pages into the TLB?

Any insights would be much appreciated.

Thanks!

What do you mean by page fault?

With virtual memory, a page fault normally means that it was swapped out and has to be loaded again, similar to Cuda’s managed memory.

But you are using cudaMalloc, so after allocation the complete memory block is allocated in physical device memory.

Or are you talking about the TLB - translation lookaside buffer - for translating virtual addresses to physical addresses? Does it have a practical performance impact on Nvidia architectures? Aside from synthetic micro-benchmarks?

Or are you talking about the pages being loaded into the L2 cache in a warm-up phase after copying? Then what is your allocation size vs. the L2 size? The L2 does not work on a page granularity, but 32 bytes or 128 bytes.

Hi,

I am talking about the translation lookaside buffer. I was thinking that if I touch the pages before they are needed then they would be cached in the TLB of the device. I am not talking about Unified Memory.

However when I try that I am not seeing any performance improvements.

A TLB only caches the translation of the address, not the pages or data itself.

If you want to accelerate the address translation, how large performance gains do you expect?

Hi Curefab,

Yes exactly and that would mean that the translation overhead would be reduced, right?

However my memory bound compute kernel sees no improvement in performance even if I cache the translations in advance…

Have you checked your allocation size against the coverage?

Some tests (Turing):

https://arxiv.org/pdf/1903.07486

The advantage would probably only be better latency, not better throughput. And latency can be hidden with well-optimized kernels.

Hi,

Yes I have made my own benchmarks on the TLB capacities and checked the bounds etc.

Why would it not increase DRAM Throughput, if the address translation is pre-fetched in the TLB .

  • A GPU is a latency hiding machine. In a bulk benchmark (e.g. not pointer chasing), perhaps this is just another latency to be hidden. It could conceivably have a small latency impact on the first few accesses, but if that is the only effect, the difference might be in the measurement noise.
  • Perhaps the actual cudaMalloc operation itself populates the TLB. If nothing else evicts it from the cache, the TLB may already have the entries necessary, whether you “pre-fetch” or not.
1 Like

Why would it increase throughput?

The TLB entries would have to be read in addition to the data. But they are few and small compared to the data. However, it is depending on your access pattern, e.g. they would be more relevant, if you just read one transaction per page altogether. But normally each page is used for more transactions, so TLB fetches make up only a small amount of your memory bandwidth.

The uneven access patterns, if some accesses have TLB hits, some have misses, are smoothed out over the parallel requests of all the Cuda threads.

Why would the effect be strong or at least noticeable in your case (access pattern) and how large an improvement would you expect from TLB prefetching (. 001%? .1% improvement? 10% improvement?) You still have not provided numbers like current performance or allocation size.

My assumption was that the first access to a page might be more expensive due to page faults, but that if the page is touched before, subsequent accesses should be faster.

Is your current test program measuring only the first access to a page? Or doing full matrix operations afterwards? If you only optimize the very first access, perhaps your kernel is 40ns faster overall? Have you enough runs to see that speed improvement above noise?

Hi @Robert_Crovella

I was thinking that prehaps cudaMalloc would already populate the pages into the TLB, however I have also tried filling the TLB with junk pages by running the compute kernels with junk data before to thrash all the useful pages, yet it yields no improvements in execution time.

is there something I am missing? Is there a specific load I need to do to make sure the page touch has an effect on the TLB

Hi @Curefab

Thanks for your response, regarding access pattern it is a Row-Major GEMV.

Furthermore, I meant that I would fetch all Matrix pages into the TLB via a page touch before running the GEMV kernel.

Thanks for your help and insightful discussion

GEMV accesses the whole matrix once.

Let’s say one TLB entry is 16 bytes for each 2 MiB.

Then your speedup from bandwidth would be 131072:131073 or 0.0008%.

Does Nsight Compute show differences in the number of bytes read for the two variants? Possibly you would set the appropriate flags not to flush the cache before profiling and to profile a group of commands instead of a single kernel at a time.

That is very insightful, thank you. Do you think you could tell me the flags for profiling in this case?

Also regarding the TLB entry estimate… what about page table walks. Isn’t the page table walk latency much higher? Wouldn’t the penalty of not having the entry in the TLB and doing the Page Table Walk to find the address translation be higher?

Once again, thank you very much!

About the profiling flags

Have a look at Application and Range Replay and at Cache Control

The page table walk comes into play, if you go from one pointer to the next jumping pages. Like a linked list. It would add up all the latencies.

Whereas GEMV has a single pointer to the array and then accesses only data (not pointers) and can do so in a parallel fashion between threads (and partly within the same thread).

But when we talk about PTW and GEMV, the data can span multiple pages correct? As an example lets say our matrix data spans 500 pages. Doesn’t that mean if each of them is a page fault, we would have 500 PTW?

Yes, but they would happen in parallel. The classical walk introduces (e.g. for benchmarking or due to the algorithm) dependencies, where one read has to be finished before the next address is known. This makes the accesses serialized so that the latency determines the kernel duration.