(GPT aided me in the question below)
I’m encountering a segmentation fault when reading data into a large array Refl (200 GB) that I allocate with cudaMallocHost. This issue started after I introduced an additional device array (d_Fop_in_nwfft) in my code. I use multiple GPUs and cap each GPU’s memory usage with a parameter perc_gpu, currently set to 95%.
I understand that cudaMallocHost allocates page-locked (pinned) memory on the host, but I’ve read that it can also indirectly put pressure on device memory due to internal driver allocations for mapping and DMA buffers. So I’m suspecting that this extra pressure might be pushing my GPU usage over the edge when I introduce the new d_Fop_in_nwfft buffer.
❓ My main question is:
Is there a reliable way to track or estimate how much memory cudaMallocHost uses from the GPU (or pinned-memory pool), so I can adjust perc_gpu accordingly and avoid exceeding the limit?
Secondary questions:
Are there profiling tools or API calls that expose this memory usage clearly?
Should I conservatively lower perc_gpu to, say, 85–90%, when using large pinned allocations?
Thanks in advance for any suggestions or clarification. The issue on cudaMallocHost being the problem is a guess; in short, what happened was: after adding some extra device arrays (keeping the 95% cap, and reducing batch size), the code started to segfault when reading a big dataset into a cudaMallocHost allocated array.
Citation needed. News to me. Have you performed some basic experiments that demonstrate this effect?
Segfault is triggered by code running on the host, so I do not see a ready connection with device arrays. Likely causes:
(1) Failing host-side memory allocation that isn’t caught (improper error handling)
(2) insufficiently sized host-side memory allocation leading to access out of bounds
(4) access out of bounds on a host-side memory allocation, either as part of incorrectly-sized bulk transfer, or faulty array index computation, or invalid pointer (including null pointer).
@njuffa Thanks for pointing me in the right direction, indeed it was nothing to do with cudaMallocHost and the reading. The segfault was unrelated; GPT pointed me towards this direction.
“Citation needed. News to me. Have you performed some basic experiments that demonstrate this effect?” When retorted GPT about that, it found NVIDIA documentation saying the exact opposite. He (it?) even said he tested it (as below), providing a full answer contradicting his first guess.
In hindsight trusting GPT’s direction and help with the question was a bad call; describing the problem without extra info should have been the correct approach. There is no evidence that a huge array (~200 GB) allocated with cudaMallocHost affects the GPU’s memory. The segmentation fault was due to another detail in the last code’s modifications and had nothing to do with the newly included device array. @striker159 I ended up not using Valgrind due to the size of the datasets, but checking host side leaks was the correct way to solve the problem.