Memory leak on RHEL 8.5 after running cudaHostRegister/cudaHostUnregister


A custom device allocates memory using dma_alloc_coherent() and return it to userland using remap_pfn_range() on .mmap routine . User mmap() the device and pinned the memory using cudaHostRegister/cudaHostUnregiste. After closing file descriptor from custom device, dmesg report a lot of failures trying to release pages since his refcount value is negative (-1023).


I prepare a small repository to show the problem
It bassically consist on a device and a userland test that triggers the memory leak. Its really easy to build and try.

My guess

I think the negative value is closely related to GUP_PIN_COUNTING_BIAS (1024). After analyzing kernel by ftrace I found:

  • cudaHostRegister() is not calling pin_user_pages()
  • cudaHostUnregister() is calling os_unpin_user_pages → unpin_user_pages

I dont know why cudaHostRegister() is not actually pinning the pages. Also, why is cudaHostRegister calling unpin_user_pages on unpinned pages.

What I need

Stop the memory leak

Here it is explained how to report a bug to nvidia. How to report a bug