I have an application that utilizes host-pinned memory (only) to provide some kernel internal information, akin to a performance profiling tool. This tool can be activated and deactivated, and in the latter case, host-pinned memory is not allocated. In an edge-case, a kernel that consumed almost all device memory (leaving only a few MBs free) was launched. If I test this workload with the tool deactivated, works. If I activate my tool, it OOMs when trying to allocate a huge in-device array (which happens after host-pinnede memory is allocated).
Question is, does host-pinned memory has any hidden cost in terms of device memory? I am thinking bookeeping or any other similar cost.
There may be a “hidden cost” in terms of device page table entries. I don’t know the exact structure of the device page table but I know it exists.
This is interesting, thank you. I assume there should be some internal data structures to perform some tasks like the ones you suggest, question is how big they are and how do they scale, i.e., fixed size or dependent on some dimensions. Do you have an idea of where to look at for this type f documentation?
Here is one datapoint. It’s really not conclusive due to CUDA lazy allocation. You don’t really know what is happening under the hood. But it is at least consistent with the idea that a host-pinned allocation may require some device memory:
To finalize this discussion, I tested with power of two sizes and a single host allocation. Device memory consumption starts scaling with a factor of two starting from 1G allocs, so in the model I tested (A100), 1GB of host-pinned memory consumes around 2MB of device memory, 2GB consumes 4MB, and so on.