Huge time cost by cublasCreate() - 10 seconds - Is this too high?

Hi All,

I am implementing a conjugate gradient (CG) solver using cuSPARSE_v2/cuBLAS_v2 libraries to cope with a large sparse matrix in my research. The wired thing I observed is the huge time cost by cublasCreate() function. I am aware that the library initialization cost is usually large, and by searching forums I found the usual time cost of cublasCreate is of ~100 ms scale, whereas in my code it cost 10 seconds! While the whole CG iteration part only cost 0.6 ~ 1 second. I also implemented CG solvers using CUSP library, which performed quite well - with the total code time of ~ 0.5 second.

So can anybody clarify me is 10-seconds a normal cost of cublasCreate? If so, why CUSP library performs much better, with a nearly neglectable initialization cost?

I am using GTX 980 Ti and CUDA-7.5. And here is my code snippet of timing cublasCreate:

// Timing begin
    struct timeval begin, end;
    gettimeofday(&begin, 0); 

    cublasStatus = cublasCreate(&cublasHandle);

    // Timing end
    gettimeofday(&end, 0); 
    float cgtime = (end.tv_sec - begin.tv_sec) * 1000.0 + (end.tv_usec - begin.tv_usec) / 1000.0;
    printf("\nTime elapse: %f ms.\n", cgtime);

Thanks a lot!


I finally found the cause - our main server node didn’t function well and couldn’t communicate with GPU nodes normally, which somehow caused the dynamic linking of cuBLAS library hindered. A reboot recovered all.

So there is no problem with cublasCreate() at this point. I post it here in case anyone encounters a similar situation (though low probability).