My use-case is that I’m trying to separate a CUDA-client application from direct control of the device using an interposition library that basically queues up kernels for a separate “scheduler” process to launch in some order later. An important part of this is somehow communicating to a separate process where to find the kernels themselves to launch in the scheduler process’ memory-space.
On the client-side, I am interposing the CUDA Driver API to catch cuLaunchKernel and, using this hack (https://devtalk.nvidia.com/default/topic/821920/how-to-get-a-kernel-functions-name-through-its-pointer-/), I am retrieving the kernel symbol name from the CUfunction input. When I catch a kernel launch called from within the cuBLAS API, “cublasSgemm_v2” for example, I can’t actually find the address for this kernel using the dynamic loading library functions. The mangled GEMM kernel name, verified by nvprof, is “_Z13gemmk1_kernelIfLi256ELi5ELb0ELb0ELb0ELb0EEv18cublasGemmk1ParamsIT_EPKS1_S4_PS”, but it is nowhere to be found at runtime using dlsym calls to libcublas.so. Any advice on how to make these BLAS kernels callable from another process?