CUDA Context-Independent Module Loading

Originally published at: https://developer.nvidia.com/blog/cuda-context-independent-module-loading/

Most CUDA developers are familiar with the cuModuleLoad API and its counterparts for loading a module containing device code into a CUDA context. In most cases, you want to load identical device code on all devices. This requires loading device code into each CUDA context explicitly. Moreover, libraries and frameworks that do not control context creation and…

Hello @jwitsoe ,

I’m in the process of integrating the CUDA 12.x library functionality into the CUDA C++ API (or API wrappers if you will) which I maintain.

I’m struggling to understand how interchangeably we can use CUkernel’s and CUfunction’s.

  • Are they really one and the same thing, behind the scenes?
  • Can all functions which take CUfunction’s also take CUkernel’s?
  • Can CUkernel’s have context-specific attributes and preferences?
  • Are any resources allocated when one uses cuKernelGetFunction ?
  • Is there reason to load modules directly into contexts, or is it basically always better to load libraries and cuKernelGetFunction on the kernels from them?

Hello @epk. Sorry for the delay in getting to your questions. Really appreciate your work and contribution towards CUDA. Let me try and answer your questions.

  • Are they really one and the same thing, behind the scenes?
    No, CUkernel and CUfunction handles are not the same thing. CUfunction is a per-
    context handle corresponding to a kernel function whereas CUkernel is context-agnostic. You can
    have N CUfunction handles per CUkernel handle where N is the number of contexts you create. As you
    pointed out, cuKernelGetFunction() can be used to get the CUfunction handle from a CUkernel handle
    and handle retrieved would correspond to the current context.
  • Can all functions which take CUfunction’s also take CUkernel’s?
    Some APIs especially the kernel launch APIs like cuLaunchKernel() take CUkernel’s as well. You just
    need to cast them to a CUfunction. We have documented APIs where it is okay to pass a CUkernel in
    place of a CUfunction.
  • Can CUkernel’s have context-specific attributes and preferences?
    Not really. But if you apply a attribute to a CUkernel, lets say using cuKernelSetAttribute(), the
    attribute will be applied to all corresponding CUfunction handles (one per context) on the specified
    device.
  • Are any resources allocated when one uses cuKernelGetFunction ?
    Resources may be allocated if the CUfunction is not already loaded. Please refer to
    CUDA C++ Programming Guide for more details.
  • Is there reason to load modules directly into contexts, or is it basically always better to load libraries
    and cuKernelGetFunction on the kernels from them?
    Loading libraries is better as you then don’t need to do any module management in your code as
    mentioned in the blog post. You don’t really need to use cuKernelGetFunction() since CUkernel can be
    passed to commonly used APIs like launch APIs. Do you have a case where you feel the need to use
    cuKernelGetFunction()?