Inquiry About GPU On-Device-TLB Behavior

qikch · September 17, 2025, 8:40am

Hello,

I’m running some experiments on H100 with large device allocations (cudaMalloc) and memory copy to device and I’m trying to understand how page access overhead works (so the page fault overhead). My assumption was that the first access to a page might be more expensive due to page faults, but that if the page is touched before, subsequent accesses should be faster.

To test this, I tried touching pages in advance (using simple global loads, both cached and non-cached) before running my main compute kernel. As an example, for a matrix I would go in page-wise strides. I expected this warm-up step to reduce kernel execution time because the pages would be present in the TLB, but I don’t see any improvement.

Details:

Memory is device-only (no UVM).
Assumed 2 MB page granularity. (This is because I read that the GPU prefers 2MB pages)
Tried both cached and non-cached loads for warming up.

Is there something about how H100 handles page translation or caching that would explain why warming up pages doesn’t affect kernel performance? Am I doing something wrong in terms of trying cache pages into the TLB?

Any insights would be much appreciated.

Thanks!

Curefab · September 17, 2025, 8:05pm

What do you mean by page fault?

With virtual memory, a page fault normally means that it was swapped out and has to be loaded again, similar to Cuda’s managed memory.

But you are using cudaMalloc, so after allocation the complete memory block is allocated in physical device memory.

Or are you talking about the TLB - translation lookaside buffer - for translating virtual addresses to physical addresses? Does it have a practical performance impact on Nvidia architectures? Aside from synthetic micro-benchmarks?

Or are you talking about the pages being loaded into the L2 cache in a warm-up phase after copying? Then what is your allocation size vs. the L2 size? The L2 does not work on a page granularity, but 32 bytes or 128 bytes.

qikch · September 17, 2025, 8:29pm

Hi,

I am talking about the translation lookaside buffer. I was thinking that if I touch the pages before they are needed then they would be cached in the TLB of the device. I am not talking about Unified Memory.

However when I try that I am not seeing any performance improvements.

Curefab · September 17, 2025, 8:35pm

A TLB only caches the translation of the address, not the pages or data itself.

If you want to accelerate the address translation, how large performance gains do you expect?

qikch · September 17, 2025, 8:38pm

Hi Curefab,

Yes exactly and that would mean that the translation overhead would be reduced, right?

However my memory bound compute kernel sees no improvement in performance even if I cache the translations in advance…

Curefab · September 17, 2025, 8:41pm

Have you checked your allocation size against the coverage?

Some tests (Turing):

https://arxiv.org/pdf/1903.07486

The advantage would probably only be better latency, not better throughput. And latency can be hidden with well-optimized kernels.

qikch · September 17, 2025, 10:23pm

Hi,

Yes I have made my own benchmarks on the TLB capacities and checked the bounds etc.

Why would it not increase DRAM Throughput, if the address translation is pre-fetched in the TLB .

Robert_Crovella · September 18, 2025, 12:51am

A GPU is a latency hiding machine. In a bulk benchmark (e.g. not pointer chasing), perhaps this is just another latency to be hidden. It could conceivably have a small latency impact on the first few accesses, but if that is the only effect, the difference might be in the measurement noise.
Perhaps the actual cudaMalloc operation itself populates the TLB. If nothing else evicts it from the cache, the TLB may already have the entries necessary, whether you “pre-fetch” or not.

Curefab · September 18, 2025, 7:27am

Why would it increase throughput?

The TLB entries would have to be read in addition to the data. But they are few and small compared to the data. However, it is depending on your access pattern, e.g. they would be more relevant, if you just read one transaction per page altogether. But normally each page is used for more transactions, so TLB fetches make up only a small amount of your memory bandwidth.

The uneven access patterns, if some accesses have TLB hits, some have misses, are smoothed out over the parallel requests of all the Cuda threads.

Why would the effect be strong or at least noticeable in your case (access pattern) and how large an improvement would you expect from TLB prefetching (. 001%? .1% improvement? 10% improvement?) You still have not provided numbers like current performance or allocation size.

My assumption was that the first access to a page might be more expensive due to page faults, but that if the page is touched before, subsequent accesses should be faster.

Is your current test program measuring only the first access to a page? Or doing full matrix operations afterwards? If you only optimize the very first access, perhaps your kernel is 40ns faster overall? Have you enough runs to see that speed improvement above noise?

qikch · September 18, 2025, 7:30am

Hi @Robert_Crovella

I was thinking that prehaps cudaMalloc would already populate the pages into the TLB, however I have also tried filling the TLB with junk pages by running the compute kernels with junk data before to thrash all the useful pages, yet it yields no improvements in execution time.

is there something I am missing? Is there a specific load I need to do to make sure the page touch has an effect on the TLB

qikch · September 18, 2025, 7:34am

Hi @Curefab

Thanks for your response, regarding access pattern it is a Row-Major GEMV.

Furthermore, I meant that I would fetch all Matrix pages into the TLB via a page touch before running the GEMV kernel.

Thanks for your help and insightful discussion

Curefab · September 18, 2025, 7:44am

GEMV accesses the whole matrix once.

Let’s say one TLB entry is 16 bytes for each 2 MiB.

Then your speedup from bandwidth would be 131072:131073 or 0.0008%.

Does Nsight Compute show differences in the number of bytes read for the two variants? Possibly you would set the appropriate flags not to flush the cache before profiling and to profile a group of commands instead of a single kernel at a time.

qikch · September 18, 2025, 8:07am

That is very insightful, thank you. Do you think you could tell me the flags for profiling in this case?

Also regarding the TLB entry estimate… what about page table walks. Isn’t the page table walk latency much higher? Wouldn’t the penalty of not having the entry in the TLB and doing the Page Table Walk to find the address translation be higher?

Once again, thank you very much!

Curefab · September 18, 2025, 2:19pm

About the profiling flags

Have a look at Application and Range Replay and at Cache Control

Curefab · September 18, 2025, 2:22pm

The page table walk comes into play, if you go from one pointer to the next jumping pages. Like a linked list. It would add up all the latencies.

Whereas GEMV has a single pointer to the array and then accesses only data (not pointers) and can do so in a parallel fashion between threads (and partly within the same thread).

qikch · September 18, 2025, 2:26pm

But when we talk about PTW and GEMV, the data can span multiple pages correct? As an example lets say our matrix data spans 500 pages. Doesn’t that mean if each of them is a page fault, we would have 500 PTW?

Curefab · September 18, 2025, 2:30pm

Yes, but they would happen in parallel. The classical walk introduces (e.g. for benchmarking or due to the algorithm) dependencies, where one read has to be finished before the next address is known. This makes the accesses serialized so that the latency determines the kernel duration.

Topic		Replies	Views
Global memory access latency access latency is as much as 1200 cycles CUDA Programming and Performance	28	17800	August 13, 2009
NVPROF showing GPU Fault though I am using cudaPrefetch CUDA Programming and Performance	7	614	December 27, 2023
Unified Memory for CUDA Beginners Technical Blog	46	3019	December 1, 2023
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1486	May 14, 2019
[GH200] Effect of ATS-enabled Unified Table on Explicit memcpy() CUDA Programming and Performance hw , cuda	0	127	August 5, 2024
cudaMallocHost caching behavior CUDA Programming and Performance	4	808	March 1, 2019
How does GPU page table and TLB management differ from CPUs? CUDA Programming and Performance	0	139	July 9, 2025
Can a page stay in host when using cudaMemAdviseSetReadMostly? CUDA Programming and Performance	4	619	January 6, 2020
Page-locked memory CUDA Programming and Performance	9	9425	April 8, 2009
Jetson TK1 memory allocation/kernel launch perfomance compared to GTX 760 CUDA Programming and Performance	0	760	October 30, 2014

Inquiry About GPU On-Device-TLB Behavior

Related topics