Performance degradation on CUDA

Bohdanms · September 12, 2018, 7:21am

Good day.

Tests I spend on the processor Intel (R) Core ™ i7-6700 CPU @ 3.40GHz.
Tests I spend on a video card NVidia GeForce GTX 750 Ti.

When I run a self-written test that uses OpenCV library functions on a computer with Intel Core i7 and GeForce GTX 750 Ti, I get the following results:

OpenCL device Name :GeForce GTX 750 Ti
OpenCL device Available :1
OpenCL device ImageSupport :1
OpenCL device OpenCL C Version:OpenCL C 1.2
OpenCL device OpenCL Version :cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
OpenCL device Driver Version :390.48
OpenCL device Version :OpenCL 1.2 CUDA

Default OpenCL device Name :GeForce GTX 750 Ti
Default OpenCL device Available :1
Default OpenCL device ImageSupport :1
Default OpenCL device OpenCL_C_Version:OpenCL C 1.2
Default OpenCL device OpenCL Version :cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
Default OpenCL device Driver Version :390.48
Default OpenCL device Version :OpenCL 1.2 CUDA

Test cv::cvtColor cicle count 150 cv::UMat 00:00:00.003
Test cv::cvtColor cicle count 150 cv::Mat 00:00:00.064

Test cv::filter2D cicle count 150 cv::UMat 00:00:00.004
Test cv::filter2D cicle count 150 cv::Mat 00:00:03.046

Test cv::threshold cicle count 150 cv::UMat 00:00:00.002
Test cv::threshold cicle count 150 cv::Mat 00:00:00.013

Test cv::dilate cicle count 150 cv::UMat 00:00:00.541
Test cv::dilate cicle count 150 cv::Mat 00:00:00.244

Test cv::bitwise_or cicle count 150 cv::UMat 00:00:00.002
Test cv::bitwise_or cicle count 150 cv::Mat 00:00:00.037

Test cv::matchTemplate cicle count 150 cv::UMat 00:00:07.533
Test cv::matchTemplate cicle count 150 cv::Mat 00:00:25.727

Test cv::minMaxLoc cicle count 150 cv::UMat 00:00:00.025
Test cv::minMaxLoc cicle count 150 cv::Mat 00:00:00.206

Test cv::findContours cicle count 150 cv::UMat 00:00:00.119
Test cv::findContours cicle count 150 cv::Mat 00:00:00.044

Test cv::multiply cicle count 150 cv::UMat 00:00:00.270
Test cv::multiply cicle count 150 cv::Mat 00:00:00.376

Next, I run the test of the same functions of the OpenCV library only adapted for processing on Gpu (using CUDA) and I get the results:

CudaEnabledDeviceCount 1
*** CUDA Device Query (Runtime API) version (CUDART static linking) ***

Device count: 1

Device 0: “GeForce GTX 750 Ti”
CUDA Driver Version / Runtime Version 9.10 / 9.10
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2001 MBytes (2098069504 bytes)
GPU Clock Speed: 1.11 GHz
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.10, CUDA Runtime Version = 9.10, NumDevs = 1

Test cv::cuda::cvtColor cicle count 150 cv::cuda::GpuMat 00:00:00.027 ← CPU 00:00:00.064, GPU (OpenCL cv::UMat) 00:00:00.003
Test cv::cuda::threshold cicle count 150 cv::cuda::GpuMat 00:00:00.025 ← CPU 00:00:00.013, GPU (OpenCL cv::UMat) 00:00:00.002
Test cv::cuda::bitwise_or cicle count 150 cv::cuda::GpuMat 00:00:00.020 ← CPU 00:00:00.002, GPU (OpenCL cv::UMat) 00:00:00.002
Test cv::cuda::minMaxLoc cicle count 150 cv::cuda::GpuMat 00:00:00.118 ← CPU 00:00:00.206, GPU (OpenCL cv::UMat) 00:00:00.025
Test cv::cuda::multiply cicle count 150 cv::cuda::GpuMat 00:00:00.150 ← CPU 00:00:00.376, GPU (OpenCL cv::UMat) 00:00:00.270

Questions:
1 - Why do tests run slower than tests with OpenCL (cv :: UMat) when using a CUDA (cv :: cuda :: GpuMat and cv :: cuda :: …) when running a self-written benchmark?

Next, I run the test on the Jetson TX2 board on which Ubuntu 16.04 is installed (OpenCV without CUDA, OpenCV is installed from NVidia Jetpack) and I get the following results:

Test cv::cvtColor cicle count 150 cv::UMat 00:00:00.150
Test cv::cvtColor cicle count 150 cv::Mat 00:00:00.146

Test cv::filter2D cicle count 150 cv::UMat 00:00:09.198
Test cv::filter2D cicle count 150 cv::Mat 00:00:09.178

Test cv::threshold cicle count 150 cv::UMat 00:00:00.139
Test cv::threshold cicle count 150 cv::Mat 00:00:00.126

Test cv::dilate cicle count 150 cv::UMat 00:00:01.031
Test cv::dilate cicle count 150 cv::Mat 00:00:01.026

Test cv::bitwise_or cicle count 150 cv::UMat 00:00:00.173
Test cv::bitwise_or cicle count 150 cv::Mat 00:00:00.159

Test cv::matchTemplate cicle count 150 cv::UMat 00:01:18.369
Test cv::matchTemplate cicle count 150 cv::Mat 00:01:17.294

Test cv::minMaxLoc cicle count 150 cv::UMat 00:00:00.832
Test cv::minMaxLoc cicle count 150 cv::Mat 00:00:00.834

Test cv::findContours cicle count 150 cv::UMat 00:00:00.662
Test cv::findContours cicle count 150 cv::Mat 00:00:00.659

Test cv::multiply cicle count 150 cv::UMat 00:00:00.280
Test cv::multiply cicle count 150 cv::Mat 00:00:00.277

Next, I run the test on the Jetson TX2 board on which Ubuntu 16.04 is installed (OpenCV with CUDA, OpenCV version 3.4.1 compiled for Tegra with CUDA) and I get the following results:

Have Open CL [ INFO:0] Initialize OpenCL runtime…
0

Use Open CL 0
Have AMD Blas 0
Have AMD FFT 0
Have SVM 0
OpenCL is not available…

CudaEnabledDeviceCount 1

*** CUDA Device Query (Runtime API) version (CUDART static linking) ***

CUDA Device count: 1

Device 0: “NVIDIA Tegra X2”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 7846 MBytes (8227401728 bytes)
GPU Clock Speed: 1.30 GHz
Max Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
Max Layered Texture Size (dim) x layers 1D=(32768) x 2048, 2D=(32768,32768) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1

Test cv::cuda::cvtColor cicle count 150 cv::cuda::GpuMat 00:00:00.446 ← CPU (cv::Mat) 00:00:00.146, GPU (OpenCL cv::UMat) 00:00:00.150
Test cv::cuda::threshold cicle count 150 cv::cuda::GpuMat 00:00:00.193 ← CPU (cv::Mat) 00:00:00.126, GPU (OpenCL cv::UMat) 00:00:00.139
Test cv::cuda::bitwise_or cicle count 150 cv::cuda::GpuMat 00:00:00.172 ← CPU (cv::Mat) 00:00:00.159, GPU (OpenCL cv::UMat) 00:00:00.173
Test cv::cuda::minMaxLoc cicle count 150 cv::cuda::GpuMat 00:00:00.918 ← CPU (cv::Mat) 00:00:00.834, GPU (OpenCL cv::UMat) 00:00:00.832
Test cv::cuda::findContours cicle count 150 cv::cuda::GpuMat 00:00:00.000 ← CPU (cv::Mat) 00:00:00.659, GPU (OpenCL cv::UMat) 00:00:00.662
Test cv::cuda::multiply cicle count 150 cv::cuda::GpuMat 00:00:01.809 ← CPU (cv::Mat) 00:00:00.277, GPU (OpenCL cv::UMat) 00:00:00.280

Questions:
1 - Why do I get no performance gain when I run a custom benchmark on a Jetson TX2 board using CUDA, and get a performance penalty?
2 - Why does not the processing of cv::UMat happen on a GPU when I run a self-written benchmark on a Jetson TX2 board using CUDA, and I see no performance gain?

The CUDA performance was also tested with the “boxFilter” program with the “benchmark” start parameter from the examples with CUDA, the results:

./boxFilter Starting…

Loaded ‘./data/lenaRGB.ppm’, 1024 x 1024 pixels
GPU Device 0: “NVIDIA Tegra X2” with compute capability 6.2

[runBenchmark]: [CUDA Iterative Box Filter]

Running BoxFilterGPU for 150 cycles…

boxFilter-texture, Throughput = 270.0057 M RGBA Pixels/s, Time = 0.00388 s, Size = 1048576 RGBA Pixels, NumDevsUsed = 1, Workgroup = 64

Time = 0.00388 - The processing time of one frame using Only CUDA (looked at the source code)

The result of the work of a similar function (BoxFilter) in the OpenCV library (OpenCL, computation on the GTX 750 Ti with OpenCL) for the same image (picture Lena.ppm 1024x1024 px, 3 channel from the example boxFilter 7.4729e-05, which is 51 times faster than CUDA on the NVidia Jetson TX2 board.

Test cv::boxFilter cicle count 150 cv::UMat 00:00:00.011 time per frame 7.4729e-05
Test cv::boxFilter cicle count 150 cv::Mat 00:00:00.491 time per frame 0.00327866

Test cv::blur cicle count 150 cv::UMat 00:00:00.003 time per frame 2.03031e-05
Test cv::blur cicle count 150 cv::Mat 00:00:00.477 time per frame 0.0031854

Question:
1 - What am I doing wrong?
2 - How can I speed up GPU processing using CUDA?

AastaLLL · September 12, 2018, 7:53am

Hi,

We have trouble in understanding your question.
Could you summarize your issue for us?

Here are some suggestions for your first:
1. OpenCL may also use GPU. What do you want to compare? CPU vs. GPU or OpenCL vs. CUDA?
2. Please remember to maximize TX2 performance with this script:

sudo ./jetson_clocks.sh

Thanks.

Bohdanms · September 12, 2018, 10:44am

After running the test on the Jetson TX2 board on which the Ubuntu 16.04 system is installed (OpenCV with CUDA, OpenCV version 3.4.1 compiled for Tegra with CUDA), pre-running the command “sudo ./jetson_clock.sh” and get the following results:

CudaEnabledDeviceCount 1

*** CUDA Device Query (Runtime API) version (CUDART static linking) *** 

Device count: 1

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 7846 MBytes (8227401728 bytes)
  GPU Clock Speed:                               1.30 GHz
  Max Texture Dimension Size (x,y,z)             1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
  Max Layered Texture Size (dim) x layers        1D=(32768) x 2048, 2D=(32768,32768) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version  = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1


/ ** Results before executing the command "sudo ./jetson_clock.sh"*/
Test cv::cuda::cvtColor     cicle count 150 cv::cuda::GpuMat 00:00:00.446 <- CPU (cv::Mat) 00:00:00.146, GPU (cv::UMat) 00:00:00.150
Test cv::cuda::threshold    cicle count 150 cv::cuda::GpuMat 00:00:00.193 <- CPU (cv::Mat) 00:00:00.126, GPU (cv::UMat) 00:00:00.139
Test cv::cuda::bitwise_or   cicle count 150 cv::cuda::GpuMat 00:00:00.172 <- CPU (cv::Mat) 00:00:00.159, GPU (cv::UMat) 00:00:00.173
Test cv::cuda::minMaxLoc    cicle count 150 cv::cuda::GpuMat 00:00:00.918 <- CPU (cv::Mat) 00:00:00.834, GPU (cv::UMat) 00:00:00.832
Test cv::cuda::findContours cicle count 150 cv::cuda::GpuMat 00:00:00.000 <- CPU (cv::Mat) 00:00:00.659, GPU (cv::UMat) 00:00:00.662
Test cv::cuda::multiply     cicle count 150 cv::cuda::GpuMat 00:00:01.809 <- CPU (cv::Mat) 00:00:00.277, GPU (cv::UMat) 00:00:00.280

/ ** Results after executing the command "sudo ./jetson_clock.sh"*/
Test cv::cuda::cvtColor     cicle count 150 cv::cuda::GpuMat 00:00:00.074 <- CPU (cv::Mat) 00:00:00.146, GPU (cv::UMat) 00:00:00.150 / the increase of 6.03 times in relation to the results above
Test cv::cuda::threshold    cicle count 150 cv::cuda::GpuMat 00:00:00.068 <- CPU (cv::Mat) 00:00:00.126, GPU (cv::UMat) 00:00:00.139 / the increase of 2.84 times in relation to the results above
Test cv::cuda::bitwise_or   cicle count 150 cv::cuda::GpuMat 00:00:00.046 <- CPU (cv::Mat) 00:00:00.159, GPU (cv::UMat) 00:00:00.173 / the increase of 3.73 times in relation to the results above
Test cv::cuda::minMaxLoc    cicle count 150 cv::cuda::GpuMat 00:00:00.207 <- CPU (cv::Mat) 00:00:00.834, GPU (cv::UMat) 00:00:00.832 / the increase of 4.43 times in relation to the results above
Test cv::cuda::multiply     cicle count 150 cv::cuda::GpuMat 00:00:00.490 <- CPU (cv::Mat) 00:00:00.277, GPU (cv::UMat) 00:00:00.280 / the increase of 3.69 times in relation to the results above

/ ** Test on the Intel (R) Core (TM) i7-6700 CPU @ 3.40GHz * /
Test cv::cvtColor      cicle count 150  cv::Mat 00:00:00.063
Test cv::threshold     cicle count 150  cv::Mat 00:00:00.017
Test cv::bitwise_or    cicle count 150  cv::Mat 00:00:00.041
Test cv::matchTemplate cicle count 150  cv::Mat 00:00:26.008
Test cv::minMaxLoc     cicle count 150  cv::Mat 00:00:00.199
Test cv::findContours  cicle count 150  cv::Mat 00:00:00.044
Test cv::multiply      cicle count 150  cv::Mat 00:00:00.384

Question:
Correctly, I understand that the tests on the Jetson TX2 board lose to the analog calculations on the CPU (Intel (R) Core ™ i7-6700 CPU @ 3.40GHz), or is there any other way to speed up the processing on Jetson TX2?

Thanks.

Bohdanms · September 13, 2018, 7:08am

I don`t undestand, does exist any way to up perfomance on board Jetsin TX2? Or I have maximum perfomance when I run commnad “sudo ./jetson_clock.sh”?

linuxdev · September 13, 2018, 10:40pm

“jetson_clocks.sh” will max clocks within the defined allowed clock range. Different modes may enable different clock ranges. If you first run “sudo nvpmodel -m 0”, and next run “sudo ~/ubuntu/tegra_clocks.sh”, then you will know that the Denver cores are also running, and that the range of clocks available are set to max.

Someone may be able to offer advise on how to profile to narrow down where it is slowing.

AastaLLL · September 14, 2018, 6:33am

Hi,

Could you monitor your program with tegrastats to check the GPU load?

sudo ~/tegrastats

By the way, it’s recommended to use our official CUDA sample for benchmarking rather than a third-party implementation.
Ex, matrixMul in CUDA sample.

Thanks.

Bohdanms · September 14, 2018, 6:56am

OpenCV partner Nvidia (https://developer.nvidia.com/gpu-accelerated-libraries). Does this mean that the OpenCV library is official? Testing should show how fast the OpenCV functions with CUDA were accelerated.

AastaLLL · September 18, 2018, 7:36am

Hi,

The implementation is from OpenCV.
We also have some vision API. It’s recommended to benchmark with it.

May I know what is your target?
Do you want to compare the speed between processors or want an optimized implementation for vision problem?
Thanks.

Bohdanms · September 18, 2018, 10:28am

My main target is compare perfomance on my PC (intel i7-6700, NVidia Geforce 750 Ti) and perfomance on Jetson TX2 board using OpenCV.
On my PC maximum perfomance when I use OpenCV with OpenCL implementation. CUDA on my pc is slower than OpenCL.
Than I should make device for Computer Vision. I want compare perfomance on devices:

intel i7-6700 (cpu processing)
NVidia GeForce 750 Ti (OpenCV with OpenCL implementaion)
NVidia GeForce 750 Ti (OpenCV with CUDA implementaion)
NVidia Tegra TX2 (OpenCV with CUDA implementation)

Also I have the algorithm which implementated on OpenCV, and at now I cann`t see to your Vision API.

Thanks.

AastaLLL · September 20, 2018, 6:41am

Hi,

Here are two available vision API for your reference:

NPP: [url]https://developer.nvidia.com/npp[/url]
VisionWorks: [url]https://developer.nvidia.com/embedded/visionworks[/url]

Thanks.

Topic		Replies	Views
Slow performance with opencv at jetson tx2 Jetson TX2	13	3897	October 18, 2021
Why CUDA slower that OpenCL? CUDA Programming and Performance	5	1528	September 12, 2018
CUDA very slow performance CUDA Programming and Performance	21	16743	March 6, 2020
Cuda 7.0 Jetson TX1 performance and benchmarks Jetson TX1	21	17176	March 16, 2017
CUDA hangups Jetson TK1	26	3667	October 18, 2021
Unexplained stalls in CUDA API calls - reproducer attached Jetson TK1	27	2936	October 18, 2021
CUDA kernel and Xavier performance Jetson AGX Xavier	8	895	October 18, 2021
Can the Xavier run OpenCL applications? Jetson AGX Xavier	15	6566	October 18, 2021
How to get performance of Video processing application Jetson TK1	10	1815	July 19, 2018
cudaMemcpyAsync execution before and after Level 1 cuBLAS kernel calls nvc, nvc++ and nvfortran cuda	7	106	October 29, 2024

Performance degradation on CUDA

Related topics