my organization handles large volumes of closed source gpu applications that run on cuda, right now we have no way of virtualizing a GPU device into “m
emory slots” for individual applications to run concurrently with eachother on the same device.
Essentially we’re looking to wrap around each cudaGetDeviceProperties and all cuda memory operations so we can virtualize smaller GPUs for individual applications.
We’re so far able to hook the cuda driver API (cuda.h) successfully, as that library dynamically links with executables (we use LD_PRELOAD and dlsym), however we’re unsure of how to properly hook the cuda runtime API (cuda_runtime.h & cuda_runtime_api.h), as both statically link to executables.
Any tips or strategies on how to interface with the cuda runtime API would be great, right now our best bet is using ptrace & nm in a separate “sentry” process that inspects the executable’s symbol table (assuming it has one) and replace specific function pointers with our wrapper pointers, however this method is quite clumsy.