what does cuModuleLoad do?

dongxiao · January 21, 2020, 7:34am

In a current project, we compile our cuda kernel code to cubin and use the runtime API to launch the kernel. During test, we find when we perform cuModuleLoad to load a cubin file, we can view GPU usage by nvidia-smi and the time taken by cuModuleLoad could be sevel milliseconds. So I want to konw that does this method do. In the document, it says:

Description
Takes a filename fname and loads the corresponding module module into the current context. The CUDA driver API does not attempt to lazily allocate the resources needed by a module; if the memory for functions and data (constant and global) needed by the module cannot be allocated, cuModuleLoad() fails. The file should be a cubin file as output by nvcc, or a PTX file either as output by nvcc or handwritten, or a fatbin file as output by nvcc from toolchain 4.0 or later.

It seems that the constant and global memory allocation will be completed during the execution of this method. For constant memory, I know the driver can determine its size at this moment. But how can the driver know the global memory size at this moment?
And the document says “attempting to allocate the resources needed by a module”, what does the word ‘resources’ refer? As there are many instructions in our cubin files (due to loop unrolling), I want to know if cuModuleLoad transfers instructions to GPU, so the time taken by this method relates to the instruction count in the cubin file.

njuffa · January 21, 2020, 11:46am

If the file to be loaded is a cubin, cuModuleLoad() will transfer the binary image contained therein to the GPU. If the file loaded contains PTX there will the a JIT compilation step before that (and this can really drive up execution time if the code is large).

Unless the file specified is already cached in system memory by the file system, the I/O time for loading the file will dominate over the time needed to download the image to the GPU. Generally speaking, the larger the image, the longer the transfer time. However, due to various fixed overheads in this transfer pipeline it is likely that there is an approximate rather than a strict linear dependency between image size and the time taken by cuLoadModule().

To minimize execution time of cuLoadModule() you would want fast mass storage (e.g. an NVMe SSD), a high-frequency CPU (e.g. >= 3.5 GHz base frequency), and fast system memory (e.g. as many channels of DDR4-2666 as possible).

Topic		Replies	Views
What is captured in CUcontext? contexts, cost of module load CUDA Programming and Performance	0	1209	October 14, 2008
Using cubin files from kernels CUDA Programming and Performance	2	3856	November 6, 2008
Are modules reference counted across host threads? threads, cuModuleLoad, cuModuleUnload CUDA Programming and Performance	0	2212	October 15, 2008
cuModuleLoad caching? CUDA Programming and Performance	0	579	April 15, 2014
CUDA Internals: cuModule & cuFunction CUDA Programming and Performance	3	5766	September 12, 2011
Using driver API to launch kernels A way to avoid external cubins? CUDA Programming and Performance	7	10588	August 14, 2008
Executing PTX via the Driver API cuModuleLoadData CUDA Programming and Performance	0	1160	July 17, 2009
Error compiling test program CUDA Programming and Performance	1	5583	July 16, 2009
CUDA_ERROR_NO_BINARY_FOR_GPU loading PTX ? Can't load PTX 'image' no matter what I do.. CUDA Programming and Performance	2	7274	April 5, 2012
Errors when loading/unloading a module repeatedly I get CUDA_UNKNOWN_ERROR CUDA Programming and Performance	4	5499	June 24, 2008

what does cuModuleLoad do?

Related topics