How to hook CUDA runtime API in CUDA 11.4

etlin · September 24, 2021, 5:35pm

As we know, we can use LD_PRELOAD to intercept the CUDA driver API, and through the example code provided by the Nvidia, I know that CUDA Runtime symbols cannot be hooked but the underlying driver ones can, so can I get the conclusion “CUDA runtime API will call driver API”?
And I observed a phenomenon : when I compile a CUDA program using NVCC v10.0, I can hook the underlying driver symbols of every CUDA runtime API, but if I compile a CUDA program using NVCC v11.4, I can’t hook the CUDA driver symbols of CUDA runtime API. I can only hook the CUDA driver symbols when I call the CUDA driver API in my program.
I am not sure of the reason, is CUDA 11.4 make some changes on compiling of the CUDA program or some other reasons. And my question is it possible to hook the CUDA driver symbols when I call the CUDA runtime API in my program(compiled with NVCC 11.4).

Robert_Crovella · September 24, 2021, 7:55pm

A library, even a dynamically loaded one, can be “linked” to in more than one way. Using the usual method of a formal link, the link mechanism will be exposed at dynamic library load time (at application start-up, under the control of the dynamic loader, part of the linux OS), and these types of links can be hooked. Such a mechanism will also show up as a dependency that can be inspected with the ldd tool.

However a library can also be “manually” loaded and connections made “manually” to routine entry points. Such usage of a runtime-loaded library cannot be linked via the LD_PRELOAD trick. We can observe that libcudart.so (and presumably, by extension, the runtime API, although this is not an exhaustive proof) does not link (using the formal/ldd-visible definition) to libcuda (note that it does link to libdl, a possible clue):

$ ldd /usr/local/cuda-11.4/lib64/libcudart.so
        linux-vdso.so.1 =>  (0x00007ffe041c4000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fce3d741000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fce3ddd2000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fce3d53c000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fce3d320000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fce3d118000)

and therefore there is no reason to presume that we could intercept calls to libcuda, originating from libcudart, via the LD_PRELOAD trick. I’m fairly certain this observation can be made on CUDA versions prior to 11.4, here is the same output from 8.0:

$ ldd /usr/local/cuda-8.0/lib64/libcudart.so
        linux-vdso.so.1 =>  (0x00007ffeb63d0000)
        libc.so.6 => /lib64/libc.so.6 (0x00007ff6e0624000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007ff6e041f000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ff6e0203000)
        librt.so.1 => /lib64/librt.so.1 (0x00007ff6dfffb000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ff6e0c79000)

The default usage of the CUDA runtime API (more properly, the default nvcc behavior) uses the static library, not the above dynamic ones, so here, to expect the LD_PRELOAD trick to work simply based on linking to the libcudart_static.a library, we would need to inspect an actual application. If it exposes a link to libcuda via ldd, then its reasonable to assume the routines might be intercepted with the LD_PRELOAD trick. If not, there is no reason to assume we could hook calls into libcuda, if my linked application shows no dependency on libcuda. Based on my observations, applications linked statically to libcudart also do not show a dependency on libcuda:

$ ldd t1896
        linux-vdso.so.1 =>  (0x00007ffe7fbd5000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f21748d5000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f21746b8000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f21744b4000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f21741ad000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f2173eaa000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2173c94000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f21738d2000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2174b0a000)

therefore I conclude there is no reason to believe that I could hook a call into libcuda using the LD_PRELOAD trick, and I also observe that this restriction/limitation is not new or different in 11.4 compared to many previous versions of CUDA.

If you have control over the application build process, you can cause the linking to the CUDA runtime API to be done dynamically, in which case you can link the runtime API calls using the LD_PRELOAD trick.

etlin · September 25, 2021, 5:23am

Yes, I agree with your opinions, if the program is static linked to cudart, then it’s not possible to hook the runtime API. And sorry for my unclear problem title. My real problem is how to hook the corresponding driver symbols of the runtime API. I have build a hook which can intercept the underlying driver symbols of the runtime API. And this hook can only work on CUDA programs which are compiled before NVCC 11.4. And the hook I write is like the example /cuda/samples/7_CUDALibraries/cuHook/libcuhook.cpp. In order to find out the problem, I use LD_DEBUG=symbols to display symbol table processing. And I use a simple program to test.

#include <stdio.h>
#include <cuda.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
        //CUdeviceptr dptr;
        //cuMemAlloc(&dptr, 1024);
         cudaSetDevice(0);
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

Compiled with NVCC-10.0

I first compile the code using NVCC-10.0

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

$ nvcc -o test test.cu
$ LD_DEBUG=symbols ./test 2 &> debug.log
$ cat debug.log | grep "cu"
   .
   .
   2783839:     symbol=cuInit;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGet;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetCount;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetName;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceTotalMem_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetAttribute;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetP2PAttribute;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDriverGetVersion;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetByPCIBusId;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetPCIBusId;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDeviceGetUuid;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDevicePrimaryCtxRetain;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDevicePrimaryCtxRelease;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDevicePrimaryCtxSetFlags;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDevicePrimaryCtxGetState;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuDevicePrimaryCtxReset;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxCreate_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetFlags;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxSetCurrent;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetCurrent;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxDetach;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetApiVersion;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetDevice;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetLimit;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxSetLimit;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetCacheConfig;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxSetCacheConfig;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetSharedMemConfig;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxGetStreamPriorityRange;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxSetSharedMemConfig;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuCtxSynchronize;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleLoad;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleLoadData;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleLoadFatBinary;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleUnload;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleGetFunction;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleGetGlobal_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleGetTexRef;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuModuleGetSurfRef;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuLinkCreate;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuLinkAddData;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuLinkAddFile;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuLinkComplete;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuLinkDestroy;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemGetInfo_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemAllocManaged;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemAlloc_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemAllocPitch_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemFree_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemGetAddressRange_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemFreeHost;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemHostAlloc;  lookup in file=/usr/lib/libcuda.so.1 [0]
   2783839:     symbol=cuMemHostGetDevicePointer_v2;  lookup in file=/usr/lib/libcuda.so.1 [0]

And it actually do the symbol lookup of the driver symbols
I believe that is the reason that I can intercept the underlying driver symbols of the runtime API.

Compiled with CUDA-11.4

$ /opt/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0

$ /opt/cuda/bin/nvcc -o test test.cu
$ LD_DEBUG=symbols ./test 2 &> debug.log
$ cat debug.log | grep "cu"

I didn’t observe the symbol lookup of the CUDA driver symbols, and I can’t hook the corresponding driver symbols of the runtime API.
So I want to know what is different between 10.0 and 11.4 during the program execution or compilation.
And why I can’t intercept the underlying driver symbols of the runtime API in the program compiled with NVCC 11.4
?

thx

etlin · September 28, 2021, 6:24am

One more question, I have read the link you can link the runtime API calls using the LD_PRELOAD trick. Is it possible to use this trick on tensorflow? According to my understanding, tensorflow is also dynamically linked to cuda runtime. However I try to use this trick in NVIDIA NGC Tensorflow container. It seems this trick didn’t work.

thx

shovsj · May 11, 2022, 3:51am

The same.

I tested nvcc 11.1.1 and it works. but nvcc 11.6.0 I failed to hook.
Any progress here?

zhuhanqi · May 11, 2022, 6:33am

Yes I have the same question

etlin · May 11, 2022, 7:17am

After CUDA 11.3, NVIDIA implement an driver API : cuGetProcAddress, to get CUDA driver symbols. Therefore, the symbol lookup of cuda driver APIs except cuGetProcAddress won’t happen during runtime. If you want to hook other cuda driver APIs, you need to hook the cuGetProcAddress first, and then let cuGetProcAddress return the modified APIs you want.

reference : cuda 11.3 new features

CUDA 11.3 also introduces a new driver and runtime API to query memory addresses for driver API functions. Previously, there was no direct way to obtain function pointers to the CUDA driver symbols. To do so, you had to call into dlopen, dlsym, or GetProcAddress. This feature implements a new driver API, cuGetProcAddress, and the corresponding new runtime API cudaGetDriverEntryPoint.

And the idea is very simple, this is the cuGetProcAddress API : CUresult CUDAAPI cuGetProcAddress(const char *symbol, void **pfn, int cudaVersion, cuuint64_t flags)
The symbol represents the requested cuda driver API , i.e., cuMemAlloc, cuMemAllocManaged. And pfn represents the returned pointer of requested cuda driver API. Therefore, you can hook cuGetProcAddress, and assign pfn a modified cuda driver API.

xu.s.wang · June 23, 2022, 4:14pm

Hi, I changed the hook to handle cuGetProcAddress. The first symbol is ‘cuGetProcAddress’. It seems a loop. cuGetProcAddress tries to get the address of itself. How to get the real address of cuGetProcAddress?

etlin · June 24, 2022, 7:50am

Hi, you can refer to the github of our lab. The source code of hook is included in this repository.

xu.s.wang · June 24, 2022, 4:50pm

Hi, thank you for the answer. BTW, I did a lot of testing on Kubeshare from your team. I notice that the hook debug log only print out information before epoch messages. After that, there is no more hook log. why does it behave like that? ‘lanchkernel’ should occur during the each step, should it not?

best

etlin · June 25, 2022, 3:08pm

Hi, thanks for your report, we need some time to find out the reason.

xu.s.wang · September 6, 2022, 5:43pm

The interpose method works fine with most of API calls, except cuMemcpyHtoD and other functions in the group of ‘memcpy’.

Here is the code snippet in my test,
CUresult CUDAAPI cuGetProcAddress(const char *symbol, void **pfn, int cudaVersion, cuuint64_t flags) {
…
} else if (strcmp(symbol, STRINGIFY(cuMemcpyHtoD)) == 0) {
if(real_func[SYM_CU_MEM_H2D] == NULL) {
real_func[SYM_CU_MEM_H2D] = *pfn;
}
std::cout << " getProc H2D" << std::endl;

 *pfn = (void *)(&cuMemcpyHtoD_l);

…
}
In the testing, ‘getProc H2D’ is printed out. The call to ‘cuMemcpyHtoD’ is intercepted correctly. However, the function pointer returned ('cuMemcpyHtoD_l) is not called as expected. The driver seems to ignore the returned function pointer.

do I miss anything special for ‘cuMemcpyHtoD’ or Is there a solution to this issue?

thanks,

liangbin156 · November 7, 2022, 2:13am

Have you solved it？

931752763 · May 10, 2023, 6:11am

Hi, can not access your github link, would you mind to update it? Really helps, thank you.

henrique.fingler · May 10, 2023, 7:38pm

It’s still there, just remove the branch:

lipracer · July 12, 2023, 3:56pm

We can replace fields in the GOT table at runtime and intercept cuda runtime functions. Here is a simple example, but it has only been tested on ubuntu.