Nsight system shows the result of GPU page faults in two lines

yiwkd2 · February 8, 2024, 5:56am

Hello, I have a question regarding the result of GPU page faults.

I see only one line of GPU page faults result is shown in the documentation (User Guide — nsight-systems 2024.1 documentation).
But, the result I got from RTX 2080 Ti with driver 535.113.01 shows two lines as shown below image.

I guess this is because the GPU and driver use two fault buffers alternatively for exclusive writing and reading.

Please, correct me if I’m wrong.

hwilper · February 8, 2024, 3:36pm

@jasoncohen

jasoncohen · February 10, 2024, 1:14am

Hi yiwkd2, I don’t believe this has anything to do with extra buffers. Especially not separate ones for read and write, because I can see in your screenshot that in one case you have a Read and a Write overlapping, but in another case you have two Reads overlapping. I think the reason for the taller row is just a GUI technique: Nsight Systems uses an automatic-height row for GPU Page Faults, so it will be the height of a normal row if none of the ranges overlap, but it will accommodate overlapping ranges by increasing the height to the minimum necessary to render the ranges without having any range covering up any part of another range.

As to why some of the ranges overlap in your report and don’t in others, I think this comes down to how your CUDA kernel is accessing the unified-memory buffers. When any thread on the GPU accesses a page in a unified memory buffer that is not currently located on the GPU, then the instruction causes a page fault, and the unified memory driver has to migrate the buffer from wherever it is (system memory or a different GPU’s video memory) to the GPU trying to access it. In general, with a CUDA kernel that has a lot of threads trying to touch memory in similar regions, the GPU is likely to encounter faults from a lot of threads at around the same time, and benefits from batching up all these faults into larger copies when these regions happen to all be adjacent pages. In some cases though, it may be that faults occur trying to access different ranges, so the faults can’t all batch together, and cause separate events at the same time. I see in the screenshot that the two overlapping Read faults are at addresses that differ by 0xA0000, or 640 kB. If you’re only accessing buffers that are a few kB, 640 kB is far enough apart that it’s reasonable the GPU might handle these separately. So I think you’re not seeing anything weird here – just multiple fault events starting at the same, which the unified memory driver is servicing and finishing at different times.

Note that if you are trying to improve performance with managed memory, you could use a Prefetch API function to move data to the GPU all at once with a single large DMA transfer before launching a kernel that will use that memory. You might also try doing manual memory management by allocating pinned host memory buffers, and using cudaMemcpyAsync to copy between these buffers and device (GPU) buffers – that should result in the fastest copies. Using CUDA streams or CUDA graphs will let you pipeline these asynchronous operations, minimizing downtime between them, allows the GPU to overlap unrelated tasks like copies and kernels, and allows the CPU to do other things while waiting for the GPU to finish.

yiwkd2 · February 15, 2024, 6:24am

@jasoncohen Thank you for the explanation.

I’m thinking of trying the access-counter-based migration.
Is it still supported only on Power PC?

system · February 29, 2024, 6:24am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nsight System Unified Memory Page Fault Handling Description in GUI Profiler Profiling Linux Targets cuda , nsight	8	267	September 9, 2025
[Question] NSys CUDA Profiler - Page Migration and Number of CPU/GPU page faults Profiling Linux Targets cuda , profiling	1	1093	June 23, 2023
NVPROF showing GPU Fault though I am using cudaPrefetch CUDA Programming and Performance	7	636	December 27, 2023
[Question] NSys CUDA Profiler - Page fault size Profiling Linux Targets	5	163	August 28, 2024
Nsight system not report unified memory page fault statistics in summery Profiling Linux Targets nsight	3	2002	March 29, 2024
Discrepancy Between GPU Page Faults and HtoD Transfers caused by page fault in Nsight Systems Profiling Linux Targets	4	124	August 6, 2025
Concurrent access the same page from two GPUs CUDA Programming and Performance	0	361	July 7, 2020
Detail page fault tracking via nsys Profiling Linux Targets	2	680	February 14, 2024
Does not contain CUDA Unified Memory CPU page faults data Profiling Linux Targets	16	1385	March 8, 2024
Unnecessary HtoD page migration overhead on write when using Unified Memory CUDA Programming and Performance	1	507	February 15, 2019

Nsight system shows the result of GPU page faults in two lines

Related topics