I’m running some experiments on H100 with large device allocations (cudaMalloc) and memory copy to device and I’m trying to understand how page access overhead works (so the page fault overhead). My assumption was that the first access to a page might be more expensive due to page faults, but that if the page is touched before, subsequent accesses should be faster.
To test this, I tried touching pages in advance (using simple global loads, both cached and non-cached) before running my main compute kernel. As an example, for a matrix I would go in page-wise strides. I expected this warm-up step to reduce kernel execution time because the pages would be present in the TLB, but I don’t see any improvement.
Details:
Memory is device-only (no UVM).
Assumed 2 MB page granularity. (This is because I read that the GPU prefers 2MB pages)
Tried both cached and non-cached loads for warming up.
Is there something about how H100 handles page translation or caching that would explain why warming up pages doesn’t affect kernel performance? Am I doing something wrong in terms of trying cache pages into the TLB?
With virtual memory, a page fault normally means that it was swapped out and has to be loaded again, similar to Cuda’s managed memory.
But you are using cudaMalloc, so after allocation the complete memory block is allocated in physical device memory.
Or are you talking about the TLB - translation lookaside buffer - for translating virtual addresses to physical addresses? Does it have a practical performance impact on Nvidia architectures? Aside from synthetic micro-benchmarks?
Or are you talking about the pages being loaded into the L2 cache in a warm-up phase after copying? Then what is your allocation size vs. the L2 size? The L2 does not work on a page granularity, but 32 bytes or 128 bytes.
I am talking about the translation lookaside buffer. I was thinking that if I touch the pages before they are needed then they would be cached in the TLB of the device. I am not talking about Unified Memory.
However when I try that I am not seeing any performance improvements.
A GPU is a latency hiding machine. In a bulk benchmark (e.g. not pointer chasing), perhaps this is just another latency to be hidden. It could conceivably have a small latency impact on the first few accesses, but if that is the only effect, the difference might be in the measurement noise.
Perhaps the actual cudaMalloc operation itself populates the TLB. If nothing else evicts it from the cache, the TLB may already have the entries necessary, whether you “pre-fetch” or not.
The TLB entries would have to be read in addition to the data. But they are few and small compared to the data. However, it is depending on your access pattern, e.g. they would be more relevant, if you just read one transaction per page altogether. But normally each page is used for more transactions, so TLB fetches make up only a small amount of your memory bandwidth.
The uneven access patterns, if some accesses have TLB hits, some have misses, are smoothed out over the parallel requests of all the Cuda threads.
Why would the effect be strong or at least noticeable in your case (access pattern) and how large an improvement would you expect from TLB prefetching (. 001%? .1% improvement? 10% improvement?) You still have not provided numbers like current performance or allocation size.
My assumption was that the first access to a page might be more expensive due to page faults, but that if the page is touched before, subsequent accesses should be faster.
Is your current test program measuring only the first access to a page? Or doing full matrix operations afterwards? If you only optimize the very first access, perhaps your kernel is 40ns faster overall? Have you enough runs to see that speed improvement above noise?
I was thinking that prehaps cudaMalloc would already populate the pages into the TLB, however I have also tried filling the TLB with junk pages by running the compute kernels with junk data before to thrash all the useful pages, yet it yields no improvements in execution time.
is there something I am missing? Is there a specific load I need to do to make sure the page touch has an effect on the TLB
Let’s say one TLB entry is 16 bytes for each 2 MiB.
Then your speedup from bandwidth would be 131072:131073 or 0.0008%.
Does Nsight Compute show differences in the number of bytes read for the two variants? Possibly you would set the appropriate flags not to flush the cache before profiling and to profile a group of commands instead of a single kernel at a time.
That is very insightful, thank you. Do you think you could tell me the flags for profiling in this case?
Also regarding the TLB entry estimate… what about page table walks. Isn’t the page table walk latency much higher? Wouldn’t the penalty of not having the entry in the TLB and doing the Page Table Walk to find the address translation be higher?
The page table walk comes into play, if you go from one pointer to the next jumping pages. Like a linked list. It would add up all the latencies.
Whereas GEMV has a single pointer to the array and then accesses only data (not pointers) and can do so in a parallel fashion between threads (and partly within the same thread).
But when we talk about PTW and GEMV, the data can span multiple pages correct? As an example lets say our matrix data spans 500 pages. Doesn’t that mean if each of them is a page fault, we would have 500 PTW?
Yes, but they would happen in parallel. The classical walk introduces (e.g. for benchmarking or due to the algorithm) dependencies, where one read has to be finished before the next address is known. This makes the accesses serialized so that the latency determines the kernel duration.