Something similar should be possible with nvblas:
It involves interception instead of re-linking. The advantage is the NVBLAS library has a goal to only accelerate calls where there will be a (performance) benefit.
CUBLAS works best when you have a sequence of linear algebra operations, where you can retain intermediate results on the GPU. You do not want to be transferring data back and forth on each library call. Since CUBLAS generally requires the programmer to manage all this, the programmer can decide when and which results need to be moved. A “dumb” BLAS drop-in replacement using CUBLAS doesn’t know this, and so has to transfer data to and from the GPU on every library call. This limits the number and sizes of problems for which it is still beneficial to use the GPU.
NVBLAS attempts to work this way, by only attempting to accelerate the routines and data sizes that will actually show a performance benefit in this “dumb” model. Other intercepted calls are just “reflected” back to your existing BLAS implementation.
To get more benefit from GPU BLAS acceleration than this, it requires something other than a dumb relink or intercept model.