Building containerized gpus from a single physical gpu - cudart static linking troubles

Hey guys,
my organization handles large volumes of closed source gpu applications that run on cuda, right now we have no way of virtualizing a GPU device into “m
emory slots” for individual applications to run concurrently with eachother on the same device.
Essentially we’re looking to wrap around each cudaGetDeviceProperties and all cuda memory operations so we can virtualize smaller GPUs for individual applications.

We’re so far able to hook the cuda driver API (cuda.h) successfully, as that library dynamically links with executables (we use LD_PRELOAD and dlsym), however we’re unsure of how to properly hook the cuda runtime API (cuda_runtime.h & cuda_runtime_api.h), as both statically link to executables.

Any tips or strategies on how to interface with the cuda runtime API would be great, right now our best bet is using ptrace & nm in a separate “sentry” process that inspects the executable’s symbol table (assuming it has one) and replace specific function pointers with our wrapper pointers, however this method is quite clumsy.

You can force the cuda runtime to dynamically link (i.e use cudart SO). There is an nvcc compile option for that, indicated in the nvcc documentation.

It’s not quite clear to me what you want to achieve with this approach. GPUs can already be dynamically shared between multiple users/applications/processes so that individual applications can run “concurrently”.

The only rub here is that kernels from separate processes will not run concurrently using this approach, however actual concurrent kernel execution is a hard thing to witness anyway, even from a single app/process. Using this methodology, however, independent processes can all have resident allocations on the GPU, and as kernels need to be executed, the GPU will context-switch from one application/process to the next.

If the above built-in sharing (device must be in Default compute mode) is not sufficient, you could also try to leverage CUDA MPS to allow independent apps to share the GPU in a less restricted fashion. A writeup is here:

http://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app

Perhaps what you are after is restricting an application/process to a particular amount of memory. That is not well provided for using CUDA AFAIK right now, so then hooking the memory allocation routines might be beneficial there.

Hey txbob thanks for the reply!
Yeah I should have clarified, we’re looking to restrict access to a block of global memory for a particular application, something like a “memory slot”. If for example a greedy application tries to allocate all of the available global memory, it would first need make a cudagetDeviceProperties call, which would return a global memory value that is equal to it’s slot size. Then when it tried to allocate memory we’d offset the pointers returned from cudaAlloc to be within the slot’s memory address space.

However all of those “tricks” are moot since we can’t reliably hook onto the cuda runtime API, what would be ideal is if we could globally hook onto any cuda runtime api function call and intercept them before they jump to the method definitions in cudart_static.a.

Any advice on how to do that is greatly appreciated! We plan on releasing the finished product as open source once we’re sure it works properly.

Hi Zeryx,

I found your thread here and I wanted to ask one questions. You mention you are able to hook into ‘cu*’ functions successfully. I’m trying to do the same thing, but it won’t happen. I run CUDA sample and LD_PRELOAD my “.so” to intercept cuLaunchKernel, but it’s not able to do that. Could you please provide some information on how are you doing this?

Here is my stub code:

#define _GNU_SOURCE
#include <cuda.h>
#include <dlfcn.h>
#include <stdio.h>
//#include <cuda_runtime.h>
//#include <driver_types.h>

void cuLaunchKernelHelper (CUstream hStream);


CUresult cuLaunchKernel (CUfunction f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, CUstream hStream, void** kernelParams, void** extra) {

        void* handle;
        CUresult (*function)(CUfunction f,  
                        unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, 
                        unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ,
                        unsigned int sharedMemBytes, CUstream hStream, void** kernelParams, void** extra);

        *(void **)(&function) = dlsym (RTLD_NEXT, "cuLaunchKernel");

        cuLaunchKernelHelper (hStream);

        (*function)(f, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, sharedMemBytes, hStream, kernelParams, extra);

}

void cuLaunchKernelHelper (CUstream hStream) {
        // Nothing
        printf ("cuLaunchHelper\n");
}

Thanks,
Saman