Performance degradation on CUDA

Good day.

Tests I spend on the processor Intel ® Core ™ i7-6700 CPU @ 3.40GHz.
Tests I spend on a video card NVidia GeForce GTX 750 Ti.

When I run a self-written test that uses OpenCV library functions on a computer with Intel Core i7 and GeForce GTX 750 Ti, I get the following results:

OpenCL device Name :GeForce GTX 750 Ti
OpenCL device Available :1
OpenCL device ImageSupport :1
OpenCL device OpenCL C Version:OpenCL C 1.2
OpenCL device OpenCL Version :cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
OpenCL device Driver Version :390.48
OpenCL device Version :OpenCL 1.2 CUDA

Default OpenCL device Name :GeForce GTX 750 Ti
Default OpenCL device Available :1
Default OpenCL device ImageSupport :1
Default OpenCL device OpenCL_C_Version:OpenCL C 1.2
Default OpenCL device OpenCL Version :cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
Default OpenCL device Driver Version :390.48
Default OpenCL device Version :OpenCL 1.2 CUDA

Test cv::cvtColor cicle count 150 cv::UMat 00:00:00.003
Test cv::cvtColor cicle count 150 cv::Mat 00:00:00.064

Test cv::filter2D cicle count 150 cv::UMat 00:00:00.004
Test cv::filter2D cicle count 150 cv::Mat 00:00:03.046

Test cv::threshold cicle count 150 cv::UMat 00:00:00.002
Test cv::threshold cicle count 150 cv::Mat 00:00:00.013

Test cv::dilate cicle count 150 cv::UMat 00:00:00.541
Test cv::dilate cicle count 150 cv::Mat 00:00:00.244

Test cv::bitwise_or cicle count 150 cv::UMat 00:00:00.002
Test cv::bitwise_or cicle count 150 cv::Mat 00:00:00.037

Test cv::matchTemplate cicle count 150 cv::UMat 00:00:07.533
Test cv::matchTemplate cicle count 150 cv::Mat 00:00:25.727

Test cv::minMaxLoc cicle count 150 cv::UMat 00:00:00.025
Test cv::minMaxLoc cicle count 150 cv::Mat 00:00:00.206

Test cv::findContours cicle count 150 cv::UMat 00:00:00.119
Test cv::findContours cicle count 150 cv::Mat 00:00:00.044

Test cv::multiply cicle count 150 cv::UMat 00:00:00.270
Test cv::multiply cicle count 150 cv::Mat 00:00:00.376

Next, I run the test of the same functions of the OpenCV library only adapted for processing on Gpu (using CUDA) and I get the results:

CudaEnabledDeviceCount 1
*** CUDA Device Query (Runtime API) version (CUDART static linking) ***

Device count: 1

Device 0: “GeForce GTX 750 Ti”
CUDA Driver Version / Runtime Version 9.10 / 9.10
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2001 MBytes (2098069504 bytes)
GPU Clock Speed: 1.11 GHz
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.10, CUDA Runtime Version = 9.10, NumDevs = 1

Test cv::cuda::cvtColor cicle count 150 cv::cuda::GpuMat 00:00:00.027 <- CPU 00:00:00.064, GPU (OpenCL cv::UMat) 00:00:00.003
Test cv::cuda::threshold cicle count 150 cv::cuda::GpuMat 00:00:00.025 <- CPU 00:00:00.013, GPU (OpenCL cv::UMat) 00:00:00.002
Test cv::cuda::bitwise_or cicle count 150 cv::cuda::GpuMat 00:00:00.020 <- CPU 00:00:00.002, GPU (OpenCL cv::UMat) 00:00:00.002
Test cv::cuda::minMaxLoc cicle count 150 cv::cuda::GpuMat 00:00:00.118 <- CPU 00:00:00.206, GPU (OpenCL cv::UMat) 00:00:00.025
Test cv::cuda::multiply cicle count 150 cv::cuda::GpuMat 00:00:00.150 <- CPU 00:00:00.376, GPU (OpenCL cv::UMat) 00:00:00.270

Questions:
1 - Why do tests run slower than tests with OpenCL (cv :: UMat) when using a CUDA (cv :: cuda :: GpuMat and cv :: cuda :: …) when running a self-written benchmark?

Next, I run the test on the Jetson TX2 board on which Ubuntu 16.04 is installed (OpenCV without CUDA, OpenCV is installed from NVidia Jetpack) and I get the following results:

Test cv::cvtColor cicle count 150 cv::UMat 00:00:00.150
Test cv::cvtColor cicle count 150 cv::Mat 00:00:00.146

Test cv::filter2D cicle count 150 cv::UMat 00:00:09.198
Test cv::filter2D cicle count 150 cv::Mat 00:00:09.178

Test cv::threshold cicle count 150 cv::UMat 00:00:00.139
Test cv::threshold cicle count 150 cv::Mat 00:00:00.126

Test cv::dilate cicle count 150 cv::UMat 00:00:01.031
Test cv::dilate cicle count 150 cv::Mat 00:00:01.026

Test cv::bitwise_or cicle count 150 cv::UMat 00:00:00.173
Test cv::bitwise_or cicle count 150 cv::Mat 00:00:00.159

Test cv::matchTemplate cicle count 150 cv::UMat 00:01:18.369
Test cv::matchTemplate cicle count 150 cv::Mat 00:01:17.294

Test cv::minMaxLoc cicle count 150 cv::UMat 00:00:00.832
Test cv::minMaxLoc cicle count 150 cv::Mat 00:00:00.834

Test cv::findContours cicle count 150 cv::UMat 00:00:00.662
Test cv::findContours cicle count 150 cv::Mat 00:00:00.659

Test cv::multiply cicle count 150 cv::UMat 00:00:00.280
Test cv::multiply cicle count 150 cv::Mat 00:00:00.277

Next, I run the test on the Jetson TX2 board on which Ubuntu 16.04 is installed (OpenCV with CUDA, OpenCV version 3.4.1 compiled for Tegra with CUDA) and I get the following results:

Have Open CL [ INFO:0] Initialize OpenCL runtime…
0

Use Open CL 0
Have AMD Blas 0
Have AMD FFT 0
Have SVM 0
OpenCL is not available…

CudaEnabledDeviceCount 1

*** CUDA Device Query (Runtime API) version (CUDART static linking) ***

CUDA Device count: 1

Device 0: “NVIDIA Tegra X2”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 7846 MBytes (8227401728 bytes)
GPU Clock Speed: 1.30 GHz
Max Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
Max Layered Texture Size (dim) x layers 1D=(32768) x 2048, 2D=(32768,32768) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1

Test cv::cuda::cvtColor cicle count 150 cv::cuda::GpuMat 00:00:00.446 <- CPU (cv::Mat) 00:00:00.146, GPU (OpenCL cv::UMat) 00:00:00.150
Test cv::cuda::threshold cicle count 150 cv::cuda::GpuMat 00:00:00.193 <- CPU (cv::Mat) 00:00:00.126, GPU (OpenCL cv::UMat) 00:00:00.139
Test cv::cuda::bitwise_or cicle count 150 cv::cuda::GpuMat 00:00:00.172 <- CPU (cv::Mat) 00:00:00.159, GPU (OpenCL cv::UMat) 00:00:00.173
Test cv::cuda::minMaxLoc cicle count 150 cv::cuda::GpuMat 00:00:00.918 <- CPU (cv::Mat) 00:00:00.834, GPU (OpenCL cv::UMat) 00:00:00.832
Test cv::cuda::findContours cicle count 150 cv::cuda::GpuMat 00:00:00.000 <- CPU (cv::Mat) 00:00:00.659, GPU (OpenCL cv::UMat) 00:00:00.662
Test cv::cuda::multiply cicle count 150 cv::cuda::GpuMat 00:00:01.809 <- CPU (cv::Mat) 00:00:00.277, GPU (OpenCL cv::UMat) 00:00:00.280

Questions:
1 - Why do I get no performance gain when I run a custom benchmark on a Jetson TX2 board using CUDA, and get a performance penalty?
2 - Why does not the processing of cv::UMat happen on a GPU when I run a self-written benchmark on a Jetson TX2 board using CUDA, and I see no performance gain?

The CUDA performance was also tested with the “boxFilter” program with the “benchmark” start parameter from the examples with CUDA, the results:

./boxFilter Starting…

Loaded ‘./data/lenaRGB.ppm’, 1024 x 1024 pixels
GPU Device 0: “NVIDIA Tegra X2” with compute capability 6.2

[runBenchmark]: [CUDA Iterative Box Filter]

Running BoxFilterGPU for 150 cycles…

boxFilter-texture, Throughput = 270.0057 M RGBA Pixels/s, Time = 0.00388 s, Size = 1048576 RGBA Pixels, NumDevsUsed = 1, Workgroup = 64

Time = 0.00388 - The processing time of one frame using Only CUDA (looked at the source code)

The result of the work of a similar function (BoxFilter) in the OpenCV library (OpenCL, computation on the GTX 750 Ti with OpenCL) for the same image (picture Lena.ppm 1024x1024 px, 3 channel from the example boxFilter 7.4729e-05, which is 51 times faster than CUDA on the NVidia Jetson TX2 board.

Test cv::boxFilter cicle count 150 cv::UMat 00:00:00.011 time per frame 7.4729e-05
Test cv::boxFilter cicle count 150 cv::Mat 00:00:00.491 time per frame 0.00327866

Test cv::blur cicle count 150 cv::UMat 00:00:00.003 time per frame 2.03031e-05
Test cv::blur cicle count 150 cv::Mat 00:00:00.477 time per frame 0.0031854

Question:
1 - What am I doing wrong?
2 - How can I speed up GPU processing using CUDA?

Hi,

We have trouble in understanding your question.
Could you summarize your issue for us?

Here are some suggestions for your first:
1. OpenCL may also use GPU. What do you want to compare? CPU vs. GPU or OpenCL vs. CUDA?
2. Please remember to maximize TX2 performance with this script:

sudo ./jetson_clocks.sh

Thanks.

After running the test on the Jetson TX2 board on which the Ubuntu 16.04 system is installed (OpenCV with CUDA, OpenCV version 3.4.1 compiled for Tegra with CUDA), pre-running the command “sudo ./jetson_clock.sh” and get the following results:

CudaEnabledDeviceCount 1

*** CUDA Device Query (Runtime API) version (CUDART static linking) *** 

Device count: 1

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 7846 MBytes (8227401728 bytes)
  GPU Clock Speed:                               1.30 GHz
  Max Texture Dimension Size (x,y,z)             1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
  Max Layered Texture Size (dim) x layers        1D=(32768) x 2048, 2D=(32768,32768) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version  = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1


/ ** Results before executing the command "sudo ./jetson_clock.sh"*/
Test cv::cuda::cvtColor     cicle count 150 cv::cuda::GpuMat 00:00:00.446 <- CPU (cv::Mat) 00:00:00.146, GPU (cv::UMat) 00:00:00.150
Test cv::cuda::threshold    cicle count 150 cv::cuda::GpuMat 00:00:00.193 <- CPU (cv::Mat) 00:00:00.126, GPU (cv::UMat) 00:00:00.139
Test cv::cuda::bitwise_or   cicle count 150 cv::cuda::GpuMat 00:00:00.172 <- CPU (cv::Mat) 00:00:00.159, GPU (cv::UMat) 00:00:00.173
Test cv::cuda::minMaxLoc    cicle count 150 cv::cuda::GpuMat 00:00:00.918 <- CPU (cv::Mat) 00:00:00.834, GPU (cv::UMat) 00:00:00.832
Test cv::cuda::findContours cicle count 150 cv::cuda::GpuMat 00:00:00.000 <- CPU (cv::Mat) 00:00:00.659, GPU (cv::UMat) 00:00:00.662
Test cv::cuda::multiply     cicle count 150 cv::cuda::GpuMat 00:00:01.809 <- CPU (cv::Mat) 00:00:00.277, GPU (cv::UMat) 00:00:00.280

/ ** Results after executing the command "sudo ./jetson_clock.sh"*/
Test cv::cuda::cvtColor     cicle count 150 cv::cuda::GpuMat 00:00:00.074 <- CPU (cv::Mat) 00:00:00.146, GPU (cv::UMat) 00:00:00.150 / the increase of 6.03 times in relation to the results above
Test cv::cuda::threshold    cicle count 150 cv::cuda::GpuMat 00:00:00.068 <- CPU (cv::Mat) 00:00:00.126, GPU (cv::UMat) 00:00:00.139 / the increase of 2.84 times in relation to the results above
Test cv::cuda::bitwise_or   cicle count 150 cv::cuda::GpuMat 00:00:00.046 <- CPU (cv::Mat) 00:00:00.159, GPU (cv::UMat) 00:00:00.173 / the increase of 3.73 times in relation to the results above
Test cv::cuda::minMaxLoc    cicle count 150 cv::cuda::GpuMat 00:00:00.207 <- CPU (cv::Mat) 00:00:00.834, GPU (cv::UMat) 00:00:00.832 / the increase of 4.43 times in relation to the results above
Test cv::cuda::multiply     cicle count 150 cv::cuda::GpuMat 00:00:00.490 <- CPU (cv::Mat) 00:00:00.277, GPU (cv::UMat) 00:00:00.280 / the increase of 3.69 times in relation to the results above

/ ** Test on the Intel (R) Core (TM) i7-6700 CPU @ 3.40GHz * /
Test cv::cvtColor      cicle count 150  cv::Mat 00:00:00.063
Test cv::threshold     cicle count 150  cv::Mat 00:00:00.017
Test cv::bitwise_or    cicle count 150  cv::Mat 00:00:00.041
Test cv::matchTemplate cicle count 150  cv::Mat 00:00:26.008
Test cv::minMaxLoc     cicle count 150  cv::Mat 00:00:00.199
Test cv::findContours  cicle count 150  cv::Mat 00:00:00.044
Test cv::multiply      cicle count 150  cv::Mat 00:00:00.384

Question:
Correctly, I understand that the tests on the Jetson TX2 board lose to the analog calculations on the CPU (Intel ® Core ™ i7-6700 CPU @ 3.40GHz), or is there any other way to speed up the processing on Jetson TX2?

Thanks.

I don`t undestand, does exist any way to up perfomance on board Jetsin TX2? Or I have maximum perfomance when I run commnad “sudo ./jetson_clock.sh”?

“jetson_clocks.sh” will max clocks within the defined allowed clock range. Different modes may enable different clock ranges. If you first run “sudo nvpmodel -m 0”, and next run “sudo ~/ubuntu/tegra_clocks.sh”, then you will know that the Denver cores are also running, and that the range of clocks available are set to max.

Someone may be able to offer advise on how to profile to narrow down where it is slowing.

Hi,

Could you monitor your program with tegrastats to check the GPU load?

sudo ~/tegrastats

By the way, it’s recommended to use our official CUDA sample for benchmarking rather than a third-party implementation.
Ex, matrixMul in CUDA sample.

Thanks.

OpenCV partner Nvidia (https://developer.nvidia.com/gpu-accelerated-libraries). Does this mean that the OpenCV library is official? Testing should show how fast the OpenCV functions with CUDA were accelerated.

Hi,

The implementation is from OpenCV.
We also have some vision API. It’s recommended to benchmark with it.

May I know what is your target?
Do you want to compare the speed between processors or want an optimized implementation for vision problem?
Thanks.

My main target is compare perfomance on my PC (intel i7-6700, NVidia Geforce 750 Ti) and perfomance on Jetson TX2 board using OpenCV.
On my PC maximum perfomance when I use OpenCV with OpenCL implementation. CUDA on my pc is slower than OpenCL.
Than I should make device for Computer Vision. I want compare perfomance on devices:

  • intel i7-6700 (cpu processing)
  • NVidia GeForce 750 Ti (OpenCV with OpenCL implementaion)
  • NVidia GeForce 750 Ti (OpenCV with CUDA implementaion)
  • NVidia Tegra TX2 (OpenCV with CUDA implementation)

Also I have the algorithm which implementated on OpenCV, and at now I cann`t see to your Vision API.

Thanks.

Hi,

Here are two available vision API for your reference:

NPP: https://developer.nvidia.com/npp
VisionWorks: https://developer.nvidia.com/embedded/visionworks

Thanks.