injecting your own implementation of API calls without modifying existing binaries can be done (on most Linux/Unix systems) through LD_PRELOAD
If you have the source code of the program, you can just use a global preprocessor macro to reroute specific function calls to a differently named version of this call like e.g.
Is there a way to write own implementation of CUDA library ?
If you want to do this, you have to provide all public functions of the original CUDA device API. With the LD_PRELOAD trick you can override just selected functions.
/usr/local/cuda/bin/…//lib64/libcudart_static.a(libcudart_static.a.o): In function cudaMalloc': (.text+0x41d60): multiple definition of cudaMalloc’
/tmp/tmpxft_00005994_00000000-10_lib.o:tmpxft_00005994_00000000-5_lib.cudafe1.cpp:(.text+0x15): first defined here
collect2: error: ld returned 1 exit status
which means nvcc already has the implementation and hence the compilation was failing. Can we resolve this ?