Why CUDA slower that OpenCL?

Bohdanms · September 12, 2018, 11:07am

Measuring the productivity of a self-test test, 150 iterations

1 GPU (Open CL) devices are detected.

OpenCL device Name :GeForce GTX 750 Ti
OpenCL device Available :1
OpenCL device ImageSupport :1
OpenCL device OpenCL C Version:OpenCL C 1.2
OpenCL device OpenCL Version :OpenCL 1.2 CUDA
OpenCL device Driver Version :390.48
OpenCL device Version :OpenCL 1.2 CUDA

Default OpenCL device Name :GeForce GTX 750 Ti
Default OpenCL device Available :1
Default OpenCL device ImageSupport :1
Default OpenCL device OpenCL_C_Version:OpenCL C 1.2
Default OpenCL device OpenCL Version :OpenCL 1.2 CUDA
Default OpenCL device Driver Version :390.48
Default OpenCL device Version :OpenCL 1.2 CUDA

CudaEnabledDeviceCount 1
*** CUDA Device Query (Runtime API) version (CUDART static linking) ***

Device count: 1

Device 0: “GeForce GTX 750 Ti”
CUDA Driver Version / Runtime Version 9.10 / 9.10
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2001 MBytes (2098069504 bytes)
GPU Clock Speed: 1.11 GHz
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.10, CUDA Runtime Version = 9.10, NumDevs = 1

Test cv::cuda::cvtColor cicle count 150 cv::cuda::GpuMat 00:00:00.033
Test cv::cuda::threshold cicle count 150 cv::cuda::GpuMat 00:00:00.029
Test cv::cuda::bitwise_or cicle count 150 cv::cuda::GpuMat 00:00:00.021
Test cv::cuda::minMaxLoc cicle count 150 cv::cuda::GpuMat 00:00:00.128
Test cv::cuda::multiply cicle count 150 cv::cuda::GpuMat 00:00:00.165

Test cv::cvtColor cicle count 150 cv::UMat 00:00:00.044
Test cv::threshold cicle count 150 cv::UMat 00:00:00.004
Test cv::bitwise_or cicle count 150 cv::UMat 00:00:00.005
Test cv::minMaxLoc cicle count 150 cv::UMat 00:00:00.022
Test cv::multiply cicle count 150 cv::UMat 00:00:00.262

Questions:
Why when I use the OpenCV library functions implemented with OpenCL support (cv :: UMat) I get a performance gain in relation to the implementation of CUDA (cv :: cuda :: GpuMat)?
I may be something that is not included when working with Cuda?
Can I put the graphics card in a performance mode?
In what can be the reason of such low indicators in relation to OpenCL?

cbuchner1 · September 12, 2018, 11:20am

Aren’t those timestamps reported in h:m:s:ms?

In this case, openCL is slower in relation to CUDA because it shows longer run times.

Which is expected because CUDA is receiving more updates and more optimizations than OpenCL (which is kind of the unwanted child at nVidia, it seems)

Bohdanms · September 12, 2018, 11:31am

Yes, timestamps have format h:m:s.ms.
But when I use Cuda (for example cv::cuda::threshold) timestamp is 00:00:00.029 for 150 iterations, and when I use OpenCL (cv::UMat and cv::threshold (…)) timestamp is 00:00:00.004 for 150 iterations. Why does cuda run slower than OpenCL? What is reason for this fact?

cbuchner1 · September 12, 2018, 1:50pm

who wrote this benchmarking code and how exactly is the kernel run time measured?

Bohdanms · September 12, 2018, 2:22pm

I wrote this benchmark. It is source code: GitHub - 2RoN4eG/Benchmark: Test OpenCV Perfomance, test cv::Mat, cv::UMat, cv::GpuMat

cbuchner1 · September 12, 2018, 3:21pm

an exemplary timing routine from your code

int64 ticks = cv::getTickCount ();

        for (size_t number = 0; number < cicleNumber; ++ number)
        {
            function (cudaFrame, cudaProcessedFrame);
        }

        double time = ((double) cv::getTickCount () - ticks) / cv::getTickFrequency ();

I would rewrite it to this. More accurate timings could be obtained using cudaEvents, as outlined here https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/

Also I would not currently know the equivalent to cudaDeviceSynchronize() in OpenCl. I am sure there is one.

// "cook" the kernel (the first run is usually slower, and we do not want to time that)
        function (cudaFrame, cudaProcessedFrame);

        // wait until cooking is finished
        cudaDeviceSynchronize();

        int64 ticks = cv::getTickCount ();

        for (size_t number = 0; number < cicleNumber; ++ number)
        {
            function (cudaFrame, cudaProcessedFrame);
        }

        // make sure we're really done on the device
        cudaDeviceSynchronize();

        double time = ((double) cv::getTickCount () - ticks) / cv::getTickFrequency ();

Topic		Replies	Views
Performance degradation on CUDA Jetson TX2	10	2251	October 18, 2021
CUDA OpenCL comparison CUDA Programming and Performance	9	3400	August 23, 2011
Slow performance with opencv at jetson tx2 Jetson TX2	13	3895	October 18, 2021
CUDA very slow performance CUDA Programming and Performance	21	16689	March 6, 2020
Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster? CUDA Programming and Performance	2	1217	October 11, 2013
Any reason to choose CUDA over OpenCL? CUDA Programming and Performance	27	26045	August 2, 2010
CUDA performance vs. openCL performance CUDA Programming and Performance	7	12371	June 8, 2012
trying to understand kernel parameters and CL_INVALID_WORK_GROUP_SIZE CUDA Programming and Performance	8	3981	February 26, 2010
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13360	July 9, 2008
Why OpenCV thresholding function is slower in GPU than CPU? CUDA Programming and Performance opencv	1	2741	September 17, 2018

Why CUDA slower that OpenCL?

Related topics