Strange Variations in Execution Time of cublas<t>geam() [cublasDgemm]

mips00 · September 1, 2021, 7:34am

Hi all,
I am using the cublasDgemm function from Cublas Library and I have tested my code on Nvidia 1080 ti and Tesla V100. I have the following matrix dimensions (to compute A’A):
[243 x 3614] x [3614 x 243]

The execution time of this call is just under 1 ms in most cases, however, it sometimes goes around 13 to 15 ms. It is worthy of mention that cudaStreamSynchronize is called right before and after the cublasDgemm call. The results of this matrix multiplication are extensively tested and there seem to be no other issue. Furthermore, there is similar issue with cv::cuda::gemm from OpenCV Library, which uses cublasDgemm in its implementation. My code is indeed based on OpenCV implementation. It should be mentioned that there is no other kernel launch to the associated Cuda device during this call and all previously launched ones on other streams are synchronized using cudaStreamSynchronize.

I would be appreciated it if anyone could help me with possible reasons for this high variation in execution time of cublasDgemm.
The main piece of code which calls cublasDgemm is based on OpenCV implementation as below (step 4 takes unusual time sometimes]:

// constructor
MyArithm::MyArithm(){
    cublasSafeCall( cublasCreate_v2(&_cublas_v2H) );
    cublas_wspace.create(32*1024*1024);
}
// destructor
MyArithm::~MyArithm()
{
    if(_cublas_v2H){
        cublasSafeCall( cublasDestroy_v2(_cublas_v2H) );
        _cublas_v2H = nullptr;
    }
}
// set stream and work space for cublas
void MyArithm::setStream(cudaStream_t c_stream)
{
    if(c_stream){
        stream =  cv::cuda::StreamAccessor::wrapStream(c_stream);
        cublasSafeCall( cublasSetStream_v2(_cublas_v2H, c_stream) );
        cublasSafeCall( cublasSetWorkspace(_cublas_v2H, (void*)cublas_wspace.data(), cublas_wspace.size()) );
    }
}
void MyArithm::gemm(const GpuMat &src1, const GpuMat &src2, double alpha, const GpuMat &src3, double beta, GpuMat &dst, int flags) const
{
    QElapsedTimer local_timer;
    local_timer.start();
   // CUBLAS works with column-major matrices
    assert( src1.type() == CV_32FC1 || src1.type() == CV_32FC2 || src1.type() == CV_64FC1 || src1.type() == CV_64FC2 );
    assert( src2.type() == src1.type() && (src3.empty() || src3.type() == src1.type()) );

    if (src1.depth() == CV_64F)
    {
        if (!deviceSupports(NATIVE_DOUBLE))
            CV_Error(cv::Error::StsUnsupportedFormat, "The device doesn't support double");
    }

    bool tr1 = (flags & GEMM_1_T) != 0;
    bool tr2 = (flags & GEMM_2_T) != 0;
    bool tr3 = (flags & GEMM_3_T) != 0;

    if (src1.type() == CV_64FC2)
    {
        if (tr1 || tr2 || tr3)
            CV_Error(cv::Error::StsNotImplemented, "transpose operation doesn't implemented for CV_64FC2 type");
    }

    Size src1Size = tr1 ? Size(src1.rows, src1.cols) : src1.size();
    Size src2Size = tr2 ? Size(src2.rows, src2.cols) : src2.size();
    Size src3Size = tr3 ? Size(src3.rows, src3.cols) : src3.size();
    Size dstSize(src2Size.width, src1Size.height);

    assert( src1Size.width == src2Size.height );
    assert( src3.empty() || src3Size == dstSize );

    if( dst.size() != dstSize  || dst.type()!=src1.type() )
        cv::cuda::createContinuous(dstSize, src1.type(), dst);

    if (beta != 0)
    {
        if (src3.empty())
        {
            dst.setTo(Scalar::all(0), stream);
        }
        else
        {
            if (tr3)
            {
                cuda::transpose(src3, dst, stream);
            }
            else
            {
                src3.copyTo(dst, stream);
            }
        }
    }

    stream.waitForCompletion(); // equivalent to cudaStreamSynchronize
    std::cout << " ** Time of gemm [step 1]  =  " << local_timer.nsecsElapsed()/1000000.0 << " ms" << std::endl;
    local_timer.start();

    cublasSafeCall( cublasSetPointerMode_v2(_cublas_v2H, CUBLAS_POINTER_MODE_HOST) );

    stream.waitForCompletion();
    std::cout << " ** Time of gemm [step 2]  =  " << local_timer.nsecsElapsed()/1000000.0 << " ms" << std::endl;
    local_timer.start();


    const float alphaf = static_cast<float>(alpha);
    const float betaf = static_cast<float>(beta);

    const cuComplex alphacf = make_cuComplex(alphaf, 0);
    const cuComplex betacf = make_cuComplex(betaf, 0);

    const cuDoubleComplex alphac = make_cuDoubleComplex(alpha, 0);
    const cuDoubleComplex betac = make_cuDoubleComplex(beta, 0);

    cublasOperation_t transa = tr2 ? CUBLAS_OP_T : CUBLAS_OP_N;
    cublasOperation_t transb = tr1 ? CUBLAS_OP_T : CUBLAS_OP_N;

    stream.waitForCompletion();
    std::cout << " ** Time of gemm [step 3]  =  " << local_timer.nsecsElapsed()/1000000.0 << " ms" << std::endl;
    local_timer.start();

    switch (src1.type())
    {
    case CV_32FC1:
        cublasSafeCall( cublasSgemm_v2(_cublas_v2H, transa, transb, tr2 ? src2.rows : src2.cols, tr1 ? src1.cols : src1.rows, tr2 ? src2.cols : src2.rows,
            &alphaf,
            src2.ptr<float>(), static_cast<int>(src2.step / sizeof(float)),
            src1.ptr<float>(), static_cast<int>(src1.step / sizeof(float)),
            &betaf,
            dst.ptr<float>(), static_cast<int>(dst.step / sizeof(float))) );
        break;

    case CV_64FC1:
        cublasSafeCall( cublasDgemm_v2(_cublas_v2H, transa, transb, tr2 ? src2.rows : src2.cols, tr1 ? src1.cols : src1.rows, tr2 ? src2.cols : src2.rows,
            &alpha,
            src2.ptr<double>(), static_cast<int>(src2.step / sizeof(double)),
            src1.ptr<double>(), static_cast<int>(src1.step / sizeof(double)),
            &beta,
            dst.ptr<double>(), static_cast<int>(dst.step / sizeof(double))) );
        break;

    case CV_32FC2:
        cublasSafeCall( cublasCgemm_v2(_cublas_v2H, transa, transb, tr2 ? src2.rows : src2.cols, tr1 ? src1.cols : src1.rows, tr2 ? src2.cols : src2.rows,
            &alphacf,
            src2.ptr<cuComplex>(), static_cast<int>(src2.step / sizeof(cuComplex)),
            src1.ptr<cuComplex>(), static_cast<int>(src1.step / sizeof(cuComplex)),
            &betacf,
            dst.ptr<cuComplex>(), static_cast<int>(dst.step / sizeof(cuComplex))) );
        break;

    case CV_64FC2:
        cublasSafeCall( cublasZgemm_v2(_cublas_v2H, transa, transb, tr2 ? src2.rows : src2.cols, tr1 ? src1.cols : src1.rows, tr2 ? src2.cols : src2.rows,
            &alphac,
            src2.ptr<cuDoubleComplex>(), static_cast<int>(src2.step / sizeof(cuDoubleComplex)),
            src1.ptr<cuDoubleComplex>(), static_cast<int>(src1.step / sizeof(cuDoubleComplex)),
            &betac,
            dst.ptr<cuDoubleComplex>(), static_cast<int>(dst.step / sizeof(cuDoubleComplex))) );
        break;
    }

    stream.waitForCompletion();
    std::cout << " ** Time of gemm [step 4]  =  " << local_timer.nsecsElapsed()/1000000.0 << " ms" << std::endl;
    std::cout << " gemm size = " << src1.size() << " - " << src2.size() << std::endl;
    local_timer.start();
    // stream.waitForCompletion();
}

mnicely · September 1, 2021, 4:08pm

Your benchmark should really average timing over several runs, maybe 100’s or 1000’s.
Have you tried profiling with Nsight Systems? This provide more information on the exact issue.
To get assistance, it’s very helpful to provide a reproducer with exact code and environment (e.g., if you’re using OpenCV a docker file of your exact environment is needed)

mips00 · September 2, 2021, 5:43pm

Dear Mr. Nicely,
Thank you for your quick response. Indeed, I have a bunch of Cuda kernel calls with an acceptable upper bound for their execution time. Due to the fps limitations along with performance, I need to guarantee a realistic upper bound for the execution time of cublasDgemm as for other ones. Furthermore, this variation is a strange behavior depend on my previous experience.
All in all, I have provided the Nvidia profiler result and streams and I hope it might help. Just for debugging purposes, I have called cublasDgemm 11 times on the same data sequentially. As it can be seen the last one has taken about 13 ms which is far from other calls.

mnicely · September 2, 2021, 5:48pm

What happens if you run 5 or 22 iterations? Is the last one hang each time?
I see 5 non-default streams in the screenshot. Is that all?

mips00 · September 2, 2021, 5:54pm

Sorry, I forgot to mention. No, indeed it is just by accident and it might happen in every order but it is not more frequent (maybe 1 of 40 calls, or even fewer). Yes, those are all active streams.

I hope I could provide an executable code with exact input in the future, but I think it might be due to something behind the implemented algorithm for cublasgemm. To be exact, might it be originated from things such as atomic operations (or locking strategy) used in cublasgemm? or there is something wrong with my project?

mnicely · September 2, 2021, 6:39pm

Isn’t the data types and size the same, each iteration?

If yes, execution times should be nearly identical, once GPU is warmed up.

mips00 · September 2, 2021, 6:50pm

Yes, Indeed, in the source code generated this Nvidia profiler result, I’ve passed the same data for 11 times to cublasDgemm and I have repeated this procedure for different inputs (this crop is for one of them with the same input data). It is worthy of mention that there is similar profile pattern on Tesla V100 and two non-Default streams instead of five streams in the last figure.

System Setting

Os: Ubuntu 18.04
Input size: [243 x 3614] x [3614 x 243] double matrices (A’A)
Nvidia Card: 1080 ti
Cuda Version: cuda 11.1

The source is compiled with below nvcc arguments:

CUDA NVCC Compiler Settings

SYSTEM_NAME = linux
SYSTEM_TYPE = 64
NVCC_OPTIONS = -use_fast_math
NVCC_GEN = -gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_75,code=sm_75
-gencode=arch=compute_80,code=sm_80

Two steps used for compiling with dlink:
1.
$$CUDA_DIR/bin/nvcc $$NVCC_OPTIONS $$NVCC_INC $$LIBS
–machine $$SYSTEM_TYPE $$NVCC_GEN -dc
$$SHARED_LIB_CUDA_FLAG
-o ${QMAKE_FILE_OUT} -c ${QMAKE_FILE_NAME}

$CUDA_DIR/bin/nvcc $LIBS \ --machine $SYSTEM_TYPE $NVCC_GEN -dlink \ -o ${QMAKE_FILE_OUT}

mnicely · September 2, 2021, 7:33pm

Can you run another experiment? Run 1000 iterations and count the ones that are “abnormal”?
You should easily be able to do this in CPU code, let’s say anything over 2ms.

Is this GPU also supporting monitors as well?

FYI, double precision + -use_fast_math has no effect.
-use_fast_math only work with single precision.

mips00 · September 2, 2021, 8:11pm

Yes, I will post the result of 1000 iterations asap. Previous results was generated on my pc with 1 Nvidia card (1080 ti) witch is also used as display card (X-server).
On the other hand, I see the same result when running my code on a google instance with 2 Tesla V100 and a docker. I think I didn’t passed display options (-v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY -e XAUTHORITY -e NVIDIA_DRIVER_CAPABILITIES=all
is not passed to docker run).

mnicely · September 2, 2021, 8:23pm

So there is no guarantee of a hard real-time timing for GPUs.

Sometimes when you see sporadic delays, on non-stochastic kernels like GEMM, it’s due to host not having a good handshake with device (managed by OS). But we would see additional white space not extended kernel.

Another idea would be to remove the OpenCV layer and only use CUDA, (see examples) or computeWorks_examples/cublas.cpp at master · mnicely/computeWorks_examples · GitHub, and see it you have the same issue.

Topic		Replies	Views
CUBLAS_STATUS_MAPPING_ERROR in cublasGetMatrix() after cublasDgemm() GPU-Accelerated Libraries	10	9282	February 21, 2013
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	942	August 23, 2018
Stream capture of cublas gemm CUDA Programming and Performance	8	690	March 31, 2025
Slow CUDA SGEMM CUDA Programming and Performance	5	651	September 15, 2022
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	581	April 6, 2023
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	10055	January 20, 2010
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	639	June 3, 2024
HPL on cuBlas : Ok, but not on Tesla 1060 Board ! Tesla board crash on large array when launchin CUDA Programming and Performance	11	30435	December 20, 2009
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7678	January 11, 2020
Matlab mex file using cublas - problems CUDA Programming and Performance	13	8956	October 13, 2009

Strange Variations in Execution Time of cublas<t>geam() [cublasDgemm]

System Setting

CUDA NVCC Compiler Settings

Related topics