Why CUDA slower that OpenCL?

Measuring the productivity of a self-test test, 150 iterations

1 GPU (Open CL) devices are detected.

OpenCL device Name :GeForce GTX 750 Ti
OpenCL device Available :1
OpenCL device ImageSupport :1
OpenCL device OpenCL C Version:OpenCL C 1.2
OpenCL device OpenCL Version :OpenCL 1.2 CUDA
OpenCL device Driver Version :390.48
OpenCL device Version :OpenCL 1.2 CUDA

Default OpenCL device Name :GeForce GTX 750 Ti
Default OpenCL device Available :1
Default OpenCL device ImageSupport :1
Default OpenCL device OpenCL_C_Version:OpenCL C 1.2
Default OpenCL device OpenCL Version :OpenCL 1.2 CUDA
Default OpenCL device Driver Version :390.48
Default OpenCL device Version :OpenCL 1.2 CUDA

CudaEnabledDeviceCount 1
*** CUDA Device Query (Runtime API) version (CUDART static linking) ***

Device count: 1

Device 0: “GeForce GTX 750 Ti”
CUDA Driver Version / Runtime Version 9.10 / 9.10
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2001 MBytes (2098069504 bytes)
GPU Clock Speed: 1.11 GHz
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.10, CUDA Runtime Version = 9.10, NumDevs = 1

Test cv::cuda::cvtColor cicle count 150 cv::cuda::GpuMat 00:00:00.033
Test cv::cuda::threshold cicle count 150 cv::cuda::GpuMat 00:00:00.029
Test cv::cuda::bitwise_or cicle count 150 cv::cuda::GpuMat 00:00:00.021
Test cv::cuda::minMaxLoc cicle count 150 cv::cuda::GpuMat 00:00:00.128
Test cv::cuda::multiply cicle count 150 cv::cuda::GpuMat 00:00:00.165

Test cv::cvtColor cicle count 150 cv::UMat 00:00:00.044
Test cv::threshold cicle count 150 cv::UMat 00:00:00.004
Test cv::bitwise_or cicle count 150 cv::UMat 00:00:00.005
Test cv::minMaxLoc cicle count 150 cv::UMat 00:00:00.022
Test cv::multiply cicle count 150 cv::UMat 00:00:00.262

Why when I use the OpenCV library functions implemented with OpenCL support (cv :: UMat) I get a performance gain in relation to the implementation of CUDA (cv :: cuda :: GpuMat)?
I may be something that is not included when working with Cuda?
Can I put the graphics card in a performance mode?
In what can be the reason of such low indicators in relation to OpenCL?

Aren’t those timestamps reported in h:m:s:ms?

In this case, openCL is slower in relation to CUDA because it shows longer run times.

Which is expected because CUDA is receiving more updates and more optimizations than OpenCL (which is kind of the unwanted child at nVidia, it seems)

Yes, timestamps have format h:m:s.ms.
But when I use Cuda (for example cv::cuda::threshold) timestamp is 00:00:00.029 for 150 iterations, and when I use OpenCL (cv::UMat and cv::threshold (…)) timestamp is 00:00:00.004 for 150 iterations. Why does cuda run slower than OpenCL? What is reason for this fact?

who wrote this benchmarking code and how exactly is the kernel run time measured?

I wrote this benchmark. It is source code: https://github.com/2RoN4eG/Benchmark

an exemplary timing routine from your code

int64 ticks = cv::getTickCount ();

        for (size_t number = 0; number < cicleNumber; ++ number)
            function (cudaFrame, cudaProcessedFrame);

        double time = ((double) cv::getTickCount () - ticks) / cv::getTickFrequency ();

I would rewrite it to this. More accurate timings could be obtained using cudaEvents, as outlined here https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/

Also I would not currently know the equivalent to cudaDeviceSynchronize() in OpenCl. I am sure there is one.

// "cook" the kernel (the first run is usually slower, and we do not want to time that)
        function (cudaFrame, cudaProcessedFrame);

        // wait until cooking is finished

        int64 ticks = cv::getTickCount ();

        for (size_t number = 0; number < cicleNumber; ++ number)
            function (cudaFrame, cudaProcessedFrame);

        // make sure we're really done on the device

        double time = ((double) cv::getTickCount () - ticks) / cv::getTickFrequency ();