How CUDA Driver Load kernel code in a shared library?

We are using PyTorch in our code and find that it will use lots of CPU RAM.
The libtorch.so of the library is about 1.2G, which contains both CPU code and GPU(kernel)code.
After some analysis
, I found that after the first CUDA call, there is a large anon memory mapping(about 1G) and it is almost resident in physical memory, which contains the cuda kernel code symbols from the dump result of cuda-gdb. See more details here: https://discuss.pytorch.org/t/why-pytorch-used-so-many-cpu-ram/73151.

My guess is that maybe cuda driver allocate that memory to store those gpu code, and those area is mapped to gpu’s virtual address space thus the gpu core can directly access and execute it. Can anyone provide more detail?

The problem here is that this memory map is private dirty, which means it can not be shared by kernel with other process. What’s the reason for that? How can we optimize the memory usage for library which have lots of kernel code?