How to hook CUDA runtime API in CUDA 11.4

As we know, we can use LD_PRELOAD to intercept the CUDA driver API, and through the example code provided by the Nvidia, I know that CUDA Runtime symbols cannot be hooked but the underlying driver ones can, so can I get the conclusion “CUDA runtime API will call driver API”?
And I observed a phenomenon : when I compile a CUDA program using NVCC v10.0, I can hook the underlying driver symbols of every CUDA runtime API, but if I compile a CUDA program using NVCC v11.4, I can’t hook the CUDA driver symbols of CUDA runtime API. I can only hook the CUDA driver symbols when I call the CUDA driver API in my program.
I am not sure of the reason, is CUDA 11.4 make some changes on compiling of the CUDA program or some other reasons. And my question is it possible to hook the CUDA driver symbols when I call the CUDA runtime API in my program(compiled with NVCC 11.4).

A library, even a dynamically loaded one, can be “linked” to in more than one way. Using the usual method of a formal link, the link mechanism will be exposed at dynamic library load time (at application start-up, under the control of the dynamic loader, part of the linux OS), and these types of links can be hooked. Such a mechanism will also show up as a dependency that can be inspected with the ldd tool.

However a library can also be “manually” loaded and connections made “manually” to routine entry points. Such usage of a runtime-loaded library cannot be linked via the LD_PRELOAD trick. We can observe that libcudart.so (and presumably, by extension, the runtime API, although this is not an exhaustive proof) does not link (using the formal/ldd-visible definition) to libcuda (note that it does link to libdl, a possible clue):

$ ldd /usr/local/cuda-11.4/lib64/libcudart.so
        linux-vdso.so.1 =>  (0x00007ffe041c4000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fce3d741000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fce3ddd2000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fce3d53c000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fce3d320000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fce3d118000)

and therefore there is no reason to presume that we could intercept calls to libcuda, originating from libcudart, via the LD_PRELOAD trick. I’m fairly certain this observation can be made on CUDA versions prior to 11.4, here is the same output from 8.0:

$ ldd /usr/local/cuda-8.0/lib64/libcudart.so
        linux-vdso.so.1 =>  (0x00007ffeb63d0000)
        libc.so.6 => /lib64/libc.so.6 (0x00007ff6e0624000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007ff6e041f000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ff6e0203000)
        librt.so.1 => /lib64/librt.so.1 (0x00007ff6dfffb000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ff6e0c79000)

The default usage of the CUDA runtime API (more properly, the default nvcc behavior) uses the static library, not the above dynamic ones, so here, to expect the LD_PRELOAD trick to work simply based on linking to the libcudart_static.a library, we would need to inspect an actual application. If it exposes a link to libcuda via ldd, then its reasonable to assume the routines might be intercepted with the LD_PRELOAD trick. If not, there is no reason to assume we could hook calls into libcuda, if my linked application shows no dependency on libcuda. Based on my observations, applications linked statically to libcudart also do not show a dependency on libcuda:

$ ldd t1896
        linux-vdso.so.1 =>  (0x00007ffe7fbd5000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f21748d5000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f21746b8000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f21744b4000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f21741ad000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f2173eaa000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2173c94000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f21738d2000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2174b0a000)

therefore I conclude there is no reason to believe that I could hook a call into libcuda using the LD_PRELOAD trick, and I also observe that this restriction/limitation is not new or different in 11.4 compared to many previous versions of CUDA.

If you have control over the application build process, you can cause the linking to the CUDA runtime API to be done dynamically, in which case you can link the runtime API calls using the LD_PRELOAD trick.

1 Like

Yes, I agree with your opinions, if the program is static linked to cudart, then it’s not possible to hook the runtime API. And sorry for my unclear problem title. My real problem is how to hook the corresponding driver symbols of the runtime API. I have build a hook which can intercept the underlying driver symbols of the runtime API. And this hook can only work on CUDA programs which are compiled before NVCC 11.4. And the hook I write is like the example /cuda/samples/7_CUDALibraries/cuHook/libcuhook.cpp. In order to find out the problem, I use LD_DEBUG=symbols to display symbol table processing. And I use a simple program to test.

#include <stdio.h>
#include <cuda.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
        //CUdeviceptr dptr;
        //cuMemAlloc(&dptr, 1024);
         cudaSetDevice(0);
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

  • Compiled with NVCC-10.0

I first compile the code using NVCC-10.0

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

$ nvcc -o test test.cu
$ LD_DEBUG=symbols ./test 2 &> debug.log
$ cat debug.log | grep "cu"
   .
   .
   2783839:     symbol=cuInit;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGet;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetCount;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetName;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceTotalMem_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetAttribute;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetP2PAttribute;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDriverGetVersion;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetByPCIBusId;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetPCIBusId;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetUuid;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDevicePrimaryCtxRetain;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDevicePrimaryCtxRelease;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDevicePrimaryCtxSetFlags;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDevicePrimaryCtxGetState;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDevicePrimaryCtxReset;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxCreate_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetFlags;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxSetCurrent;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetCurrent;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxDetach;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetApiVersion;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetDevice;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetLimit;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxSetLimit;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetCacheConfig;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxSetCacheConfig;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetSharedMemConfig;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetStreamPriorityRange;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxSetSharedMemConfig;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxSynchronize;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleLoad;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleLoadData;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleLoadFatBinary;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleUnload;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleGetFunction;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleGetGlobal_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleGetTexRef;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleGetSurfRef;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuLinkCreate;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuLinkAddData;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuLinkAddFile;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuLinkComplete;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuLinkDestroy;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemGetInfo_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemAllocManaged;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemAlloc_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemAllocPitch_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemFree_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemGetAddressRange_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemFreeHost;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemHostAlloc;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemHostGetDevicePointer_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]


And it actually do the symbol lookup of the driver symbols
I believe that is the reason that I can intercept the underlying driver symbols of the runtime API.

  • Compiled with CUDA-11.4
$ /opt/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0

$ /opt/cuda/bin/nvcc -o test test.cu
$ LD_DEBUG=symbols ./test 2 &> debug.log
$ cat debug.log | grep "cu"

I didn’t observe the symbol lookup of the CUDA driver symbols, and I can’t hook the corresponding driver symbols of the runtime API.
So I want to know what is different between 10.0 and 11.4 during the program execution or compilation.
And why I can’t intercept the underlying driver symbols of the runtime API in the program compiled with NVCC 11.4
?

thx

One more question, I have read the link you can link the runtime API calls using the LD_PRELOAD trick. Is it possible to use this trick on tensorflow? According to my understanding, tensorflow is also dynamically linked to cuda runtime. However I try to use this trick in NVIDIA NGC Tensorflow container. It seems this trick didn’t work.

thx