Are modules reference counted across host threads? threads, cuModuleLoad, cuModuleUnload

I’m using the cuModuleLoad/cuModuleUnload driver interface. My question is whether the modules are reference counted by the runtime or if I have to manage reference counting myself.

Also, if I have several host threads, each with a CUcontext and bound to the same GPU device, and if each CUcontext has the same CUmodule loaded via cuModuleLoad, are separate copies of the .cubin kernels stored or are the kernels cached in constant memory? It would seem wasteful to have multiple kernels stored multiple times on the same GPU device, even if they belong to different CUcontexts.

This brings up another related question: since each host thread issues cuModuleLoad and cuModuleUnload, if the implementation keeps a reference count of modules, the reference count must be thread-safe. Can anyone confirm this (assuming reference counting is done at all)?

Is there a way to determine how much memory a .cubin file will take in memory once expanded by cuModuleLoad, other than checking the memory usage before and after the loading? I assume the .cubin is compressed and then expanded when loaded.

Lots of questions, but the documentation is vague in some areas, especially around what’s really going on “under the hood”. And I need to get under the hood!