call cublas in cuda kernel and use static link

In my project i need to call cublas in cuda kernel, and use static link with cublas. I write a test code to verify my ideas, but compile failed

#include <cuda_runtime.h>
#include <cublas_v2.h>

extern "C" {
__global__ void testcublas(float *d_B, float *d_A, float* d_refC) {
    cublasHandle_t cb_handle = NULL;
    cublasStatus_t status = cublasCreate(&cb_handle);
    int M = 512, N = 512, K = 512;
    float alpha = 1.0, beta = 1.0;
    cublasSgemm(
                cb_handle, CUBLAS_OP_N, CUBLAS_OP_N,
                N, M, K,
                &alpha, d_B, N, d_A, K,
                &beta,  d_refC, N
               );
}

}

compile:

nvcc -arch=sm_70 -lcublas_static -lculibos --relocatable-device-code=true cublas_call.cu -o cublas.cubin

but got errors:
nvlink error : Undefined reference to ‘cublasCreate_v2’ in ‘/tmp/tmpxft_0001d286_00000000-10_cublas_call.o’
nvlink error : Undefined reference to ‘cublasSgemm_v2’ in ‘/tmp/tmpxft_0001d286_00000000-10_cublas_call.o’

Any suggestions about the program would be very helpful!
thanks in advance.

You need to be using CUDA 9.2 or before. This capability is deleted and no longer available in CUDA 10.0 and beyond. On Windows I recommend CUDA 9.1 or before.

You need to link against cublas device library.
You need to link against the device runtime library.

-lcublas_device -lcudadevrt

If you have CUDA 9.2 or before, study the Makefile for simpleDevLibCUBLAS sample code.

Thanks Robert!

By the way, would you please talk about the reason of delete that capability in CUDA10.0, Is CUDA10.0 has a better solution for that case?