Good day.
Tests I spend on the processor Intel (R) Core ™ i7-6700 CPU @ 3.40GHz.
Tests I spend on a video card NVidia GeForce GTX 750 Ti.
When I run a self-written test that uses OpenCV library functions on a computer with Intel Core i7 and GeForce GTX 750 Ti, I get the following results:
OpenCL device Name :GeForce GTX 750 Ti
OpenCL device Available :1
OpenCL device ImageSupport :1
OpenCL device OpenCL C Version:OpenCL C 1.2
OpenCL device OpenCL Version :cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
OpenCL device Driver Version :390.48
OpenCL device Version :OpenCL 1.2 CUDA
Default OpenCL device Name :GeForce GTX 750 Ti
Default OpenCL device Available :1
Default OpenCL device ImageSupport :1
Default OpenCL device OpenCL_C_Version:OpenCL C 1.2
Default OpenCL device OpenCL Version :cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
Default OpenCL device Driver Version :390.48
Default OpenCL device Version :OpenCL 1.2 CUDA
Test cv::cvtColor cicle count 150 cv::UMat 00:00:00.003
Test cv::cvtColor cicle count 150 cv::Mat 00:00:00.064
Test cv::filter2D cicle count 150 cv::UMat 00:00:00.004
Test cv::filter2D cicle count 150 cv::Mat 00:00:03.046
Test cv::threshold cicle count 150 cv::UMat 00:00:00.002
Test cv::threshold cicle count 150 cv::Mat 00:00:00.013
Test cv::dilate cicle count 150 cv::UMat 00:00:00.541
Test cv::dilate cicle count 150 cv::Mat 00:00:00.244
Test cv::bitwise_or cicle count 150 cv::UMat 00:00:00.002
Test cv::bitwise_or cicle count 150 cv::Mat 00:00:00.037
Test cv::matchTemplate cicle count 150 cv::UMat 00:00:07.533
Test cv::matchTemplate cicle count 150 cv::Mat 00:00:25.727
Test cv::minMaxLoc cicle count 150 cv::UMat 00:00:00.025
Test cv::minMaxLoc cicle count 150 cv::Mat 00:00:00.206
Test cv::findContours cicle count 150 cv::UMat 00:00:00.119
Test cv::findContours cicle count 150 cv::Mat 00:00:00.044
Test cv::multiply cicle count 150 cv::UMat 00:00:00.270
Test cv::multiply cicle count 150 cv::Mat 00:00:00.376
Next, I run the test of the same functions of the OpenCV library only adapted for processing on Gpu (using CUDA) and I get the results:
CudaEnabledDeviceCount 1
*** CUDA Device Query (Runtime API) version (CUDART static linking) ***
Device count: 1
Device 0: “GeForce GTX 750 Ti”
CUDA Driver Version / Runtime Version 9.10 / 9.10
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2001 MBytes (2098069504 bytes)
GPU Clock Speed: 1.11 GHz
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.10, CUDA Runtime Version = 9.10, NumDevs = 1
Test cv::cuda::cvtColor cicle count 150 cv::cuda::GpuMat 00:00:00.027 ← CPU 00:00:00.064, GPU (OpenCL cv::UMat) 00:00:00.003
Test cv::cuda::threshold cicle count 150 cv::cuda::GpuMat 00:00:00.025 ← CPU 00:00:00.013, GPU (OpenCL cv::UMat) 00:00:00.002
Test cv::cuda::bitwise_or cicle count 150 cv::cuda::GpuMat 00:00:00.020 ← CPU 00:00:00.002, GPU (OpenCL cv::UMat) 00:00:00.002
Test cv::cuda::minMaxLoc cicle count 150 cv::cuda::GpuMat 00:00:00.118 ← CPU 00:00:00.206, GPU (OpenCL cv::UMat) 00:00:00.025
Test cv::cuda::multiply cicle count 150 cv::cuda::GpuMat 00:00:00.150 ← CPU 00:00:00.376, GPU (OpenCL cv::UMat) 00:00:00.270
Questions:
1 - Why do tests run slower than tests with OpenCL (cv :: UMat) when using a CUDA (cv :: cuda :: GpuMat and cv :: cuda :: …) when running a self-written benchmark?
Next, I run the test on the Jetson TX2 board on which Ubuntu 16.04 is installed (OpenCV without CUDA, OpenCV is installed from NVidia Jetpack) and I get the following results:
Test cv::cvtColor cicle count 150 cv::UMat 00:00:00.150
Test cv::cvtColor cicle count 150 cv::Mat 00:00:00.146
Test cv::filter2D cicle count 150 cv::UMat 00:00:09.198
Test cv::filter2D cicle count 150 cv::Mat 00:00:09.178
Test cv::threshold cicle count 150 cv::UMat 00:00:00.139
Test cv::threshold cicle count 150 cv::Mat 00:00:00.126
Test cv::dilate cicle count 150 cv::UMat 00:00:01.031
Test cv::dilate cicle count 150 cv::Mat 00:00:01.026
Test cv::bitwise_or cicle count 150 cv::UMat 00:00:00.173
Test cv::bitwise_or cicle count 150 cv::Mat 00:00:00.159
Test cv::matchTemplate cicle count 150 cv::UMat 00:01:18.369
Test cv::matchTemplate cicle count 150 cv::Mat 00:01:17.294
Test cv::minMaxLoc cicle count 150 cv::UMat 00:00:00.832
Test cv::minMaxLoc cicle count 150 cv::Mat 00:00:00.834
Test cv::findContours cicle count 150 cv::UMat 00:00:00.662
Test cv::findContours cicle count 150 cv::Mat 00:00:00.659
Test cv::multiply cicle count 150 cv::UMat 00:00:00.280
Test cv::multiply cicle count 150 cv::Mat 00:00:00.277
Next, I run the test on the Jetson TX2 board on which Ubuntu 16.04 is installed (OpenCV with CUDA, OpenCV version 3.4.1 compiled for Tegra with CUDA) and I get the following results:
Have Open CL [ INFO:0] Initialize OpenCL runtime…
0
Use Open CL 0
Have AMD Blas 0
Have AMD FFT 0
Have SVM 0
OpenCL is not available…
CudaEnabledDeviceCount 1
*** CUDA Device Query (Runtime API) version (CUDART static linking) ***
CUDA Device count: 1
Device 0: “NVIDIA Tegra X2”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 7846 MBytes (8227401728 bytes)
GPU Clock Speed: 1.30 GHz
Max Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
Max Layered Texture Size (dim) x layers 1D=(32768) x 2048, 2D=(32768,32768) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Test cv::cuda::cvtColor cicle count 150 cv::cuda::GpuMat 00:00:00.446 ← CPU (cv::Mat) 00:00:00.146, GPU (OpenCL cv::UMat) 00:00:00.150
Test cv::cuda::threshold cicle count 150 cv::cuda::GpuMat 00:00:00.193 ← CPU (cv::Mat) 00:00:00.126, GPU (OpenCL cv::UMat) 00:00:00.139
Test cv::cuda::bitwise_or cicle count 150 cv::cuda::GpuMat 00:00:00.172 ← CPU (cv::Mat) 00:00:00.159, GPU (OpenCL cv::UMat) 00:00:00.173
Test cv::cuda::minMaxLoc cicle count 150 cv::cuda::GpuMat 00:00:00.918 ← CPU (cv::Mat) 00:00:00.834, GPU (OpenCL cv::UMat) 00:00:00.832
Test cv::cuda::findContours cicle count 150 cv::cuda::GpuMat 00:00:00.000 ← CPU (cv::Mat) 00:00:00.659, GPU (OpenCL cv::UMat) 00:00:00.662
Test cv::cuda::multiply cicle count 150 cv::cuda::GpuMat 00:00:01.809 ← CPU (cv::Mat) 00:00:00.277, GPU (OpenCL cv::UMat) 00:00:00.280
Questions:
1 - Why do I get no performance gain when I run a custom benchmark on a Jetson TX2 board using CUDA, and get a performance penalty?
2 - Why does not the processing of cv::UMat happen on a GPU when I run a self-written benchmark on a Jetson TX2 board using CUDA, and I see no performance gain?
The CUDA performance was also tested with the “boxFilter” program with the “benchmark” start parameter from the examples with CUDA, the results:
./boxFilter Starting…
Loaded ‘./data/lenaRGB.ppm’, 1024 x 1024 pixels
GPU Device 0: “NVIDIA Tegra X2” with compute capability 6.2
[runBenchmark]: [CUDA Iterative Box Filter]
Running BoxFilterGPU for 150 cycles…
boxFilter-texture, Throughput = 270.0057 M RGBA Pixels/s, Time = 0.00388 s, Size = 1048576 RGBA Pixels, NumDevsUsed = 1, Workgroup = 64
Time = 0.00388 - The processing time of one frame using Only CUDA (looked at the source code)
The result of the work of a similar function (BoxFilter) in the OpenCV library (OpenCL, computation on the GTX 750 Ti with OpenCL) for the same image (picture Lena.ppm 1024x1024 px, 3 channel from the example boxFilter 7.4729e-05, which is 51 times faster than CUDA on the NVidia Jetson TX2 board.
Test cv::boxFilter cicle count 150 cv::UMat 00:00:00.011 time per frame 7.4729e-05
Test cv::boxFilter cicle count 150 cv::Mat 00:00:00.491 time per frame 0.00327866
Test cv::blur cicle count 150 cv::UMat 00:00:00.003 time per frame 2.03031e-05
Test cv::blur cicle count 150 cv::Mat 00:00:00.477 time per frame 0.0031854
Question:
1 - What am I doing wrong?
2 - How can I speed up GPU processing using CUDA?