Hello, I have a question regarding the result of GPU page faults.
I see only one line of GPU page faults result is shown in the documentation (User Guide — nsight-systems 2024.1 documentation).
But, the result I got from RTX 2080 Ti with driver 535.113.01 shows two lines as shown below image.
Hi yiwkd2, I don’t believe this has anything to do with extra buffers. Especially not separate ones for read and write, because I can see in your screenshot that in one case you have a Read and a Write overlapping, but in another case you have two Reads overlapping. I think the reason for the taller row is just a GUI technique: Nsight Systems uses an automatic-height row for GPU Page Faults, so it will be the height of a normal row if none of the ranges overlap, but it will accommodate overlapping ranges by increasing the height to the minimum necessary to render the ranges without having any range covering up any part of another range.
As to why some of the ranges overlap in your report and don’t in others, I think this comes down to how your CUDA kernel is accessing the unified-memory buffers. When any thread on the GPU accesses a page in a unified memory buffer that is not currently located on the GPU, then the instruction causes a page fault, and the unified memory driver has to migrate the buffer from wherever it is (system memory or a different GPU’s video memory) to the GPU trying to access it. In general, with a CUDA kernel that has a lot of threads trying to touch memory in similar regions, the GPU is likely to encounter faults from a lot of threads at around the same time, and benefits from batching up all these faults into larger copies when these regions happen to all be adjacent pages. In some cases though, it may be that faults occur trying to access different ranges, so the faults can’t all batch together, and cause separate events at the same time. I see in the screenshot that the two overlapping Read faults are at addresses that differ by 0xA0000, or 640 kB. If you’re only accessing buffers that are a few kB, 640 kB is far enough apart that it’s reasonable the GPU might handle these separately. So I think you’re not seeing anything weird here – just multiple fault events starting at the same, which the unified memory driver is servicing and finishing at different times.
Note that if you are trying to improve performance with managed memory, you could use a Prefetch API function to move data to the GPU all at once with a single large DMA transfer before launching a kernel that will use that memory. You might also try doing manual memory management by allocating pinned host memory buffers, and using cudaMemcpyAsync to copy between these buffers and device (GPU) buffers – that should result in the fastest copies. Using CUDA streams or CUDA graphs will let you pipeline these asynchronous operations, minimizing downtime between them, allows the GPU to overlap unrelated tasks like copies and kernels, and allows the CPU to do other things while waiting for the GPU to finish.