what does cuModuleLoad do?

In a current project, we compile our cuda kernel code to cubin and use the runtime API to launch the kernel. During test, we find when we perform cuModuleLoad to load a cubin file, we can view GPU usage by nvidia-smi and the time taken by cuModuleLoad could be sevel milliseconds. So I want to konw that does this method do. In the document, it says:

Takes a filename fname and loads the corresponding module module into the current context. The CUDA driver API does not attempt to lazily allocate the resources needed by a module; if the memory for functions and data (constant and global) needed by the module cannot be allocated, cuModuleLoad() fails. The file should be a cubin file as output by nvcc, or a PTX file either as output by nvcc or handwritten, or a fatbin file as output by nvcc from toolchain 4.0 or later.

It seems that the constant and global memory allocation will be completed during the execution of this method. For constant memory, I know the driver can determine its size at this moment. But how can the driver know the global memory size at this moment?
And the document says “attempting to allocate the resources needed by a module”, what does the word ‘resources’ refer? As there are many instructions in our cubin files (due to loop unrolling), I want to know if cuModuleLoad transfers instructions to GPU, so the time taken by this method relates to the instruction count in the cubin file.

If the file to be loaded is a cubin, cuModuleLoad() will transfer the binary image contained therein to the GPU. If the file loaded contains PTX there will the a JIT compilation step before that (and this can really drive up execution time if the code is large).

Unless the file specified is already cached in system memory by the file system, the I/O time for loading the file will dominate over the time needed to download the image to the GPU. Generally speaking, the larger the image, the longer the transfer time. However, due to various fixed overheads in this transfer pipeline it is likely that there is an approximate rather than a strict linear dependency between image size and the time taken by cuLoadModule().

To minimize execution time of cuLoadModule() you would want fast mass storage (e.g. an NVMe SSD), a high-frequency CPU (e.g. >= 3.5 GHz base frequency), and fast system memory (e.g. as many channels of DDR4-2666 as possible).