In a current project, we compile our cuda kernel code to cubin and use the runtime API to launch the kernel. During test, we find when we perform cuModuleLoad to load a cubin file, we can view GPU usage by nvidia-smi and the time taken by cuModuleLoad could be sevel milliseconds. So I want to konw that does this method do. In the document, it says:
Description Takes a filename fname and loads the corresponding module module into the current context. The CUDA driver API does not attempt to lazily allocate the resources needed by a module; if the memory for functions and data (constant and global) needed by the module cannot be allocated, cuModuleLoad() fails. The file should be a cubin file as output by nvcc, or a PTX file either as output by nvcc or handwritten, or a fatbin file as output by nvcc from toolchain 4.0 or later.
It seems that the constant and global memory allocation will be completed during the execution of this method. For constant memory, I know the driver can determine its size at this moment. But how can the driver know the global memory size at this moment?
And the document says “attempting to allocate the resources needed by a module”, what does the word ‘resources’ refer? As there are many instructions in our cubin files (due to loop unrolling), I want to know if cuModuleLoad transfers instructions to GPU, so the time taken by this method relates to the instruction count in the cubin file.