Lazy Loading

This is a question sent directly to me, so repeating as a post:

What is lazy loading and what benefits should I expect from it.

Lazy loading is a catch-all term we use to describe a few different techniques for lowering memory consumption by CUDA applications. Ultimately, they all boil down to not loading functions either into host memory or into the GPU’s memory until the first time the application needs to call them.

If you think about some of our larger libraries, like cuDNN or cuBLAS, there are tens of thousands of kernels you could run although a typical application calls maybe 5% of them. By not loading the other 95%, you can see substantial savings in both the time it takes to load the application (less data transfer to the GPU) and lower memory utilization (functions that aren’t called aren’t loaded). In some applications this can be very substantial.

It’s worth noting that because we don’t load functions until you call them, it does change the latency of the functions at the first invocation. That’s usually un-noticeable for most applications and the net effect will be a performance increase, but if you have an application that’s particularly latency sensitive you can switch it off wiht the CUDA_MODULE_LOADING environment variable (valid settings are EAGER to turn it off, and LAZY which will turn it on).

Lazy loading is enabled by default in CUDA 12.2 for Linux and 12.3 for Windows.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.