Strange Variations in Execution Time of cublas<t>geam() [cublasDgemm]

Hi all,
I am using the cublasDgemm function from Cublas Library and I have tested my code on Nvidia 1080 ti and Tesla V100. I have the following matrix dimensions (to compute A’A):
[243 x 3614] x [3614 x 243]

The execution time of this call is just under 1 ms in most cases, however, it sometimes goes around 13 to 15 ms. It is worthy of mention that cudaStreamSynchronize is called right before and after the cublasDgemm call. The results of this matrix multiplication are extensively tested and there seem to be no other issue. Furthermore, there is similar issue with cv::cuda::gemm from OpenCV Library, which uses cublasDgemm in its implementation. My code is indeed based on OpenCV implementation. It should be mentioned that there is no other kernel launch to the associated Cuda device during this call and all previously launched ones on other streams are synchronized using cudaStreamSynchronize.

I would be appreciated it if anyone could help me with possible reasons for this high variation in execution time of cublasDgemm.
The main piece of code which calls cublasDgemm is based on OpenCV implementation as below (step 4 takes unusual time sometimes]:

// constructor
    cublasSafeCall( cublasCreate_v2(&_cublas_v2H) );
// destructor
        cublasSafeCall( cublasDestroy_v2(_cublas_v2H) );
        _cublas_v2H = nullptr;
// set stream and work space for cublas
void MyArithm::setStream(cudaStream_t c_stream)
        stream =  cv::cuda::StreamAccessor::wrapStream(c_stream);
        cublasSafeCall( cublasSetStream_v2(_cublas_v2H, c_stream) );
        cublasSafeCall( cublasSetWorkspace(_cublas_v2H, (void*), cublas_wspace.size()) );
void MyArithm::gemm(const GpuMat &src1, const GpuMat &src2, double alpha, const GpuMat &src3, double beta, GpuMat &dst, int flags) const
    QElapsedTimer local_timer;
   // CUBLAS works with column-major matrices
    assert( src1.type() == CV_32FC1 || src1.type() == CV_32FC2 || src1.type() == CV_64FC1 || src1.type() == CV_64FC2 );
    assert( src2.type() == src1.type() && (src3.empty() || src3.type() == src1.type()) );

    if (src1.depth() == CV_64F)
        if (!deviceSupports(NATIVE_DOUBLE))
            CV_Error(cv::Error::StsUnsupportedFormat, "The device doesn't support double");

    bool tr1 = (flags & GEMM_1_T) != 0;
    bool tr2 = (flags & GEMM_2_T) != 0;
    bool tr3 = (flags & GEMM_3_T) != 0;

    if (src1.type() == CV_64FC2)
        if (tr1 || tr2 || tr3)
            CV_Error(cv::Error::StsNotImplemented, "transpose operation doesn't implemented for CV_64FC2 type");

    Size src1Size = tr1 ? Size(src1.rows, src1.cols) : src1.size();
    Size src2Size = tr2 ? Size(src2.rows, src2.cols) : src2.size();
    Size src3Size = tr3 ? Size(src3.rows, src3.cols) : src3.size();
    Size dstSize(src2Size.width, src1Size.height);

    assert( src1Size.width == src2Size.height );
    assert( src3.empty() || src3Size == dstSize );

    if( dst.size() != dstSize  || dst.type()!=src1.type() )
        cv::cuda::createContinuous(dstSize, src1.type(), dst);

    if (beta != 0)
        if (src3.empty())
            dst.setTo(Scalar::all(0), stream);
            if (tr3)
                cuda::transpose(src3, dst, stream);
                src3.copyTo(dst, stream);

    stream.waitForCompletion(); // equivalent to cudaStreamSynchronize
    std::cout << " ** Time of gemm [step 1]  =  " << local_timer.nsecsElapsed()/1000000.0 << " ms" << std::endl;

    cublasSafeCall( cublasSetPointerMode_v2(_cublas_v2H, CUBLAS_POINTER_MODE_HOST) );

    std::cout << " ** Time of gemm [step 2]  =  " << local_timer.nsecsElapsed()/1000000.0 << " ms" << std::endl;

    const float alphaf = static_cast<float>(alpha);
    const float betaf = static_cast<float>(beta);

    const cuComplex alphacf = make_cuComplex(alphaf, 0);
    const cuComplex betacf = make_cuComplex(betaf, 0);

    const cuDoubleComplex alphac = make_cuDoubleComplex(alpha, 0);
    const cuDoubleComplex betac = make_cuDoubleComplex(beta, 0);

    cublasOperation_t transa = tr2 ? CUBLAS_OP_T : CUBLAS_OP_N;
    cublasOperation_t transb = tr1 ? CUBLAS_OP_T : CUBLAS_OP_N;

    std::cout << " ** Time of gemm [step 3]  =  " << local_timer.nsecsElapsed()/1000000.0 << " ms" << std::endl;

    switch (src1.type())
    case CV_32FC1:
        cublasSafeCall( cublasSgemm_v2(_cublas_v2H, transa, transb, tr2 ? src2.rows : src2.cols, tr1 ? src1.cols : src1.rows, tr2 ? src2.cols : src2.rows,
            src2.ptr<float>(), static_cast<int>(src2.step / sizeof(float)),
            src1.ptr<float>(), static_cast<int>(src1.step / sizeof(float)),
            dst.ptr<float>(), static_cast<int>(dst.step / sizeof(float))) );

    case CV_64FC1:
        cublasSafeCall( cublasDgemm_v2(_cublas_v2H, transa, transb, tr2 ? src2.rows : src2.cols, tr1 ? src1.cols : src1.rows, tr2 ? src2.cols : src2.rows,
            src2.ptr<double>(), static_cast<int>(src2.step / sizeof(double)),
            src1.ptr<double>(), static_cast<int>(src1.step / sizeof(double)),
            dst.ptr<double>(), static_cast<int>(dst.step / sizeof(double))) );

    case CV_32FC2:
        cublasSafeCall( cublasCgemm_v2(_cublas_v2H, transa, transb, tr2 ? src2.rows : src2.cols, tr1 ? src1.cols : src1.rows, tr2 ? src2.cols : src2.rows,
            src2.ptr<cuComplex>(), static_cast<int>(src2.step / sizeof(cuComplex)),
            src1.ptr<cuComplex>(), static_cast<int>(src1.step / sizeof(cuComplex)),
            dst.ptr<cuComplex>(), static_cast<int>(dst.step / sizeof(cuComplex))) );

    case CV_64FC2:
        cublasSafeCall( cublasZgemm_v2(_cublas_v2H, transa, transb, tr2 ? src2.rows : src2.cols, tr1 ? src1.cols : src1.rows, tr2 ? src2.cols : src2.rows,
            src2.ptr<cuDoubleComplex>(), static_cast<int>(src2.step / sizeof(cuDoubleComplex)),
            src1.ptr<cuDoubleComplex>(), static_cast<int>(src1.step / sizeof(cuDoubleComplex)),
            dst.ptr<cuDoubleComplex>(), static_cast<int>(dst.step / sizeof(cuDoubleComplex))) );

    std::cout << " ** Time of gemm [step 4]  =  " << local_timer.nsecsElapsed()/1000000.0 << " ms" << std::endl;
    std::cout << " gemm size = " << src1.size() << " - " << src2.size() << std::endl;
    // stream.waitForCompletion();
  1. Your benchmark should really average timing over several runs, maybe 100’s or 1000’s.
  2. Have you tried profiling with Nsight Systems? This provide more information on the exact issue.
  3. To get assistance, it’s very helpful to provide a reproducer with exact code and environment (e.g., if you’re using OpenCV a docker file of your exact environment is needed)
1 Like

Dear Mr. Nicely,
Thank you for your quick response. Indeed, I have a bunch of Cuda kernel calls with an acceptable upper bound for their execution time. Due to the fps limitations along with performance, I need to guarantee a realistic upper bound for the execution time of cublasDgemm as for other ones. Furthermore, this variation is a strange behavior depend on my previous experience.
All in all, I have provided the Nvidia profiler result and streams and I hope it might help. Just for debugging purposes, I have called cublasDgemm 11 times on the same data sequentially. As it can be seen the last one has taken about 13 ms which is far from other calls.

  1. What happens if you run 5 or 22 iterations? Is the last one hang each time?
  2. I see 5 non-default streams in the screenshot. Is that all?

Sorry, I forgot to mention. No, indeed it is just by accident and it might happen in every order but it is not more frequent (maybe 1 of 40 calls, or even fewer). Yes, those are all active streams.

I hope I could provide an executable code with exact input in the future, but I think it might be due to something behind the implemented algorithm for cublasgemm. To be exact, might it be originated from things such as atomic operations (or locking strategy) used in cublasgemm? or there is something wrong with my project?

Isn’t the data types and size the same, each iteration?

If yes, execution times should be nearly identical, once GPU is warmed up.

Yes, Indeed, in the source code generated this Nvidia profiler result, I’ve passed the same data for 11 times to cublasDgemm and I have repeated this procedure for different inputs (this crop is for one of them with the same input data). It is worthy of mention that there is similar profile pattern on Tesla V100 and two non-Default streams instead of five streams in the last figure.

System Setting

  • Os: Ubuntu 18.04
  • Input size: [243 x 3614] x [3614 x 243] double matrices (A’A)
  • Nvidia Card: 1080 ti
  • Cuda Version: cuda 11.1

The source is compiled with below nvcc arguments:

CUDA NVCC Compiler Settings

NVCC_OPTIONS = -use_fast_math
NVCC_GEN = -gencode=arch=compute_52,code=sm_52

Two steps used for compiling with dlink:
–machine $$SYSTEM_TYPE $$NVCC_GEN -dc

$CUDA_DIR/bin/nvcc $LIBS \ --machine $SYSTEM_TYPE $NVCC_GEN -dlink \ -o ${QMAKE_FILE_OUT}

Can you run another experiment? Run 1000 iterations and count the ones that are “abnormal”?
You should easily be able to do this in CPU code, let’s say anything over 2ms.

Is this GPU also supporting monitors as well?

FYI, double precision + -use_fast_math has no effect.
-use_fast_math only work with single precision.

1 Like

Yes, I will post the result of 1000 iterations asap. Previous results was generated on my pc with 1 Nvidia card (1080 ti) witch is also used as display card (X-server).
On the other hand, I see the same result when running my code on a google instance with 2 Tesla V100 and a docker. I think I didn’t passed display options (-v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY -e XAUTHORITY -e NVIDIA_DRIVER_CAPABILITIES=all
is not passed to docker run).

So there is no guarantee of a hard real-time timing for GPUs.

Sometimes when you see sporadic delays, on non-stochastic kernels like GEMM, it’s due to host not having a good handshake with device (managed by OS). But we would see additional white space not extended kernel.

Another idea would be to remove the OpenCV layer and only use CUDA, (see examples) or, and see it you have the same issue.

1 Like