Slow performance with opencv at jetson tx2

arseal35 · August 18, 2018, 9:04pm

I have the same code in my laptop (HP Notebook 250 G6 Intel Core i5-7200U/8GB/256GB SSD/15.6") with xubuntu 16 and nvidia jetson tx2. In my laptop pc this code is going fluid and fine, it’s perfect:

import cv2
import numpy as np

RESCALE_FACTOR = 0.5

cap = cv2.VideoCapture(0)
cap.set(3, int(1280*RESCALE_FACTOR))
cap.set(4, int(720*RESCALE_FACTOR))

for i in range(15):
    _, frame = cap.read()

r = cv2.selectROI(frame)

while 1:
    _, frame = cap.read()

    img_crop = frame[int(r[1]):int(r[1]+r[3]), int(r[0]):int(r[0]+r[2])]

    hsv = cv2.cvtColor(img_crop, cv2.COLOR_BGR2HSV)

    lower_red = np.array([0, 50, 0])
    upper_red = np.array([20, 255, 255])

    mask = cv2.inRange(hsv, lower_red, upper_red)
    res = cv2.bitwise_and(img_crop, img_crop, mask=mask)
    blur = cv2.GaussianBlur(res, (15, 15), 0)
    # cv2.imshow('Gaussian Blurring', blur)

    _, puck = cv2.threshold(blur, 70, 255, cv2.THRESH_BINARY)
    cv2.imshow('Puck', puck)

    xs = np.where(puck != 0)[1]
    ys = np.where(puck != 0)[0]

    xstd = np.std(xs)
    ystd = np.std(ys)

    x_init_avg = np.mean(xs)
    y_init_avg = np.mean(ys)

    xs = [x for x in xs if x <= x_init_avg+xstd or x >= x_init_avg-xstd]
    ys = [y for y in ys if y <= y_init_avg+xstd or y >= y_init_avg-xstd]
    xavg = np.mean(xs)
    yavg = np.mean(ys)

    print(xavg, yavg)

    k = cv2.waitKey(5) & 0xFF
    if k == 27:
        break

This is the code in jetson tx2:

import cv2
import numpy as np

RESCALE_FACTOR = 0.5

gst = "nvcamerasrc ! video/x-raw(memory:NVMM), width=(int)640, height=(int)480, format=(string)I420, framerate=(fraction)30/1 ! nvvidconv ! video/x-raw, format=(string)BGRx ! videoconvert ! video/x-raw, format=(string)BGR ! appsink "
cap = cv2.VideoCapture(gst)

for i in range(15):
    _, frame = cap.read()

r = cv2.selectROI(frame)

while 1:
    _, frame = cap.read()

    img_crop = frame[int(r[1]):int(r[1]+r[3]), int(r[0]):int(r[0]+r[2])]

    hsv = cv2.cvtColor(img_crop, cv2.COLOR_BGR2HSV)

    lower_red = np.array([0, 50, 0])
    upper_red = np.array([20, 255, 255])

    mask = cv2.inRange(hsv, lower_red, upper_red)
    res = cv2.bitwise_and(img_crop, img_crop, mask=mask)
    blur = cv2.GaussianBlur(res, (15, 15), 0)
    # cv2.imshow('Gaussian Blurring', blur)

    _, puck = cv2.threshold(blur, 70, 255, cv2.THRESH_BINARY)
    cv2.imshow('Puck', puck)

    xs = np.where(puck != 0)[1]
    ys = np.where(puck != 0)[0]

    xstd = np.std(xs)
    ystd = np.std(ys)

    x_init_avg = np.mean(xs)
    y_init_avg = np.mean(ys)

    xs = [x for x in xs if x <= x_init_avg+xstd or x >= x_init_avg-xstd]
    ys = [y for y in ys if y <= y_init_avg+xstd or y >= y_init_avg-xstd]
    xavg = np.mean(xs)
    yavg = np.mean(ys)

    print(xavg, yavg)

    k = cv2.waitKey(5) & 0xFF
    if k == 27:
        break

The problem is in jetson tx2. The cpu is 99% and video rendered is not fluid and slow.

Do i need some trick to perform jetson tx2 video rendered?

Thank you!

WayneWWW · August 20, 2018, 3:30am

arseal35,

Could you dump the result of “sudo ./tegrastats” when running your app?

arseal35 · August 22, 2018, 3:12pm

Of course:

RAM 1396/7851MB (lfb 1412x4MB) cpu [0%@2035,0%@2035,0%@2035,0%@2033,0%@2034,0%@2034] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1398/7851MB (lfb 1412x4MB) cpu [10%@2036,1%@2036,99%@2035,5%@2035,6%@2034,6%@2034] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1396/7851MB (lfb 1412x4MB) cpu [13%@2010,3%@2035,98%@2035,10%@2008,11%@2013,21%@2008] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1395/7851MB (lfb 1412x4MB) cpu [9%@2012,0%@2035,99%@2034,11%@2011,8%@2014,7%@2011] EMC 6%@1866 APE 150 GR3D 9%@1300
RAM 1397/7851MB (lfb 1412x4MB) cpu [8%@2034,1%@2035,99%@2034,8%@2034,10%@2034,7%@2034] EMC 6%@1866 APE 150 GR3D 1%@1300
RAM 1397/7851MB (lfb 1412x4MB) cpu [5%@2034,1%@2035,98%@2034,14%@2035,7%@2033,5%@2035] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1397/7851MB (lfb 1412x4MB) cpu [9%@2036,0%@2034,99%@2035,2%@2034,5%@2036,3%@2035] EMC 6%@1866 APE 150 GR3D 10%@1300
RAM 1396/7851MB (lfb 1412x4MB) cpu [4%@2035,1%@2035,99%@2035,5%@2035,5%@2036,4%@2035] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1396/7851MB (lfb 1412x4MB) cpu [1%@2034,1%@2036,100%@2036,5%@2035,2%@2035,6%@2034] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1398/7851MB (lfb 1412x4MB) cpu [4%@2035,3%@2034,98%@2034,2%@2034,7%@2035,8%@2035] EMC 6%@1866 APE 150 GR3D 6%@1300
RAM 1398/7851MB (lfb 1412x4MB) cpu [4%@2034,2%@2035,99%@2036,4%@2035,4%@2034,7%@2035] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1399/7851MB (lfb 1412x4MB) cpu [5%@2036,2%@2036,100%@2035,4%@2035,7%@2034,5%@2035] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1397/7851MB (lfb 1412x4MB) cpu [6%@2034,0%@2034,98%@2035,4%@2035,3%@2034,3%@2035] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1397/7851MB (lfb 1412x4MB) cpu [6%@2034,0%@2035,99%@2035,4%@2034,3%@2035,3%@2035] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1397/7851MB (lfb 1412x4MB) cpu [4%@2033,0%@2035,99%@2035,3%@2035,6%@2035,3%@2036] EMC 6%@1866 APE 150 GR3D 0%@1300
RAM 1397/7851MB (lfb 1412x4MB) cpu [11%@2035,3%@2035,98%@2034,9%@2035,8%@2034,15%@2035] EMC 6%@1866 APE 150 GR3D 1%@1300
RAM 1397/7851MB (lfb 1412x4MB) cpu [8%@2035,3%@2035,99%@2035,8%@2036,7%@2036,8%@2034] EMC 6%@1866 APE 150 GR3D 0%@1300

The problem is video rendering.It is quite a lot slow. i’ve uploaded a video demo with the issue:

Receiving more light or adding objects in front of the camera, the video slows down.

Thank you!

WayneWWW · August 23, 2018, 2:19am

arseal35,

Yes, one cpu core is occupied by 100%, but you didn’t use other cpu and even the gpu.

To improve the performance, I think you should modify the code. Your code is totally using cpu for calculating. Also, it is a single thread program.

It sounds to me that you expect a single core of CPU on tegra to be equivalent to i5-7200U. How is the frame rate on tegra and how is it on i5?

arseal35 · August 23, 2018, 12:04pm

Thank you for response.

Where can i find a guide/resource to use gpu or severals cpus in this code instead of one cpu.

I need a optimus performance for my factory.

Thank you so much!

Regards

Bohdanms · September 10, 2018, 7:32am

Hi all, I have some problem. I don`t see perfomance when I use OpenCV with cuda. I make OpenCV from source code. But when I use OpenCL (cv::UMat) I have perfomance.

WayneWWW · September 10, 2018, 7:58am

Bohdanms,

Your description is not clear. What function did you use with opencv + cuda?

Bohdanms · September 12, 2018, 6:59am

Good day.

Tests I spend on the processor Intel (R) Core ™ i7-6700 CPU @ 3.40GHz.
Tests I spend on a video card NVidia GeForce GTX 750 Ti.

When I run a self-written test that uses OpenCV library functions on a computer with Intel Core i7 and GeForce GTX 750 Ti, I get the following results:

OpenCL device Name            :GeForce GTX 750 Ti
OpenCL device Available       :1
OpenCL device ImageSupport    :1
OpenCL device OpenCL C Version:OpenCL C 1.2 
OpenCL device OpenCL Version  :cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
OpenCL device Driver Version  :390.48
OpenCL device Version         :OpenCL 1.2 CUDA

Default OpenCL device Name            :GeForce GTX 750 Ti
Default OpenCL device Available       :1
Default OpenCL device ImageSupport    :1
Default OpenCL device OpenCL_C_Version:OpenCL C 1.2 
Default OpenCL device OpenCL Version  :cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
Default OpenCL device Driver Version  :390.48
Default OpenCL device Version         :OpenCL 1.2 CUDA


Test cv::cvtColor      cicle count 150 cv::UMat 00:00:00.003
Test cv::cvtColor      cicle count 150  cv::Mat 00:00:00.064

Test cv::filter2D      cicle count 150 cv::UMat 00:00:00.004
Test cv::filter2D      cicle count 150  cv::Mat 00:00:03.046

Test cv::threshold     cicle count 150 cv::UMat 00:00:00.002
Test cv::threshold     cicle count 150  cv::Mat 00:00:00.013

Test cv::dilate        cicle count 150 cv::UMat 00:00:00.541
Test cv::dilate        cicle count 150  cv::Mat 00:00:00.244

Test cv::bitwise_or    cicle count 150 cv::UMat 00:00:00.002
Test cv::bitwise_or    cicle count 150  cv::Mat 00:00:00.037

Test cv::matchTemplate cicle count 150 cv::UMat 00:00:07.533
Test cv::matchTemplate cicle count 150  cv::Mat 00:00:25.727

Test cv::minMaxLoc     cicle count 150 cv::UMat 00:00:00.025
Test cv::minMaxLoc     cicle count 150  cv::Mat 00:00:00.206

Test cv::findContours  cicle count 150 cv::UMat 00:00:00.119
Test cv::findContours  cicle count 150  cv::Mat 00:00:00.044

Test cv::multiply      cicle count 150 cv::UMat 00:00:00.270
Test cv::multiply      cicle count 150  cv::Mat 00:00:00.376

Next, I run the test of the same functions of the OpenCV library only adapted for processing on Gpu (using CUDA) and I get the results:

CudaEnabledDeviceCount 1
*** CUDA Device Query (Runtime API) version (CUDART static linking) *** 

Device count: 1

Device 0: "GeForce GTX 750 Ti"
  CUDA Driver Version / Runtime Version          9.10 / 9.10
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2001 MBytes (2098069504 bytes)
  GPU Clock Speed:                               1.11 GHz
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version  = 9.10, CUDA Runtime Version = 9.10, NumDevs = 1

Test cv::cuda::cvtColor   cicle count 150 cv::cuda::GpuMat 00:00:00.027 <- CPU 00:00:00.064, GPU (OpenCL cv::UMat) 00:00:00.003
Test cv::cuda::threshold  cicle count 150 cv::cuda::GpuMat 00:00:00.025 <- CPU 00:00:00.013, GPU (OpenCL cv::UMat) 00:00:00.002
Test cv::cuda::bitwise_or cicle count 150 cv::cuda::GpuMat 00:00:00.020 <- CPU 00:00:00.002, GPU (OpenCL cv::UMat) 00:00:00.002
Test cv::cuda::minMaxLoc  cicle count 150 cv::cuda::GpuMat 00:00:00.118 <- CPU 00:00:00.206, GPU (OpenCL cv::UMat) 00:00:00.025 
Test cv::cuda::multiply   cicle count 150 cv::cuda::GpuMat 00:00:00.150 <- CPU 00:00:00.376, GPU (OpenCL cv::UMat) 00:00:00.270

Questions:
1 - Why do tests run slower than tests with OpenCL (cv :: UMat) when using a CUDA (cv :: cuda :: GpuMat and cv :: cuda :: …) when running a self-written benchmark?

Next, I run the test on the Jetson TX2 board on which Ubuntu 16.04 is installed (OpenCV without CUDA, OpenCV is installed from NVidia Jetpack) and I get the following results:

Test cv::cvtColor      cicle count 150 cv::UMat 00:00:00.150
Test cv::cvtColor      cicle count 150  cv::Mat 00:00:00.146

Test cv::filter2D      cicle count 150 cv::UMat 00:00:09.198
Test cv::filter2D      cicle count 150  cv::Mat 00:00:09.178

Test cv::threshold     cicle count 150 cv::UMat 00:00:00.139
Test cv::threshold     cicle count 150  cv::Mat 00:00:00.126

Test cv::dilate        cicle count 150 cv::UMat 00:00:01.031
Test cv::dilate        cicle count 150  cv::Mat 00:00:01.026

Test cv::bitwise_or    cicle count 150 cv::UMat 00:00:00.173
Test cv::bitwise_or    cicle count 150  cv::Mat 00:00:00.159

Test cv::matchTemplate cicle count 150 cv::UMat 00:01:18.369
Test cv::matchTemplate cicle count 150  cv::Mat 00:01:17.294

Test cv::minMaxLoc     cicle count 150 cv::UMat 00:00:00.832
Test cv::minMaxLoc     cicle count 150  cv::Mat 00:00:00.834

Test cv::findContours  cicle count 150 cv::UMat 00:00:00.662
Test cv::findContours  cicle count 150  cv::Mat 00:00:00.659

Test cv::multiply      cicle count 150 cv::UMat 00:00:00.280
Test cv::multiply      cicle count 150  cv::Mat 00:00:00.277

Next, I run the test on the Jetson TX2 board on which Ubuntu 16.04 is installed (OpenCV with CUDA, OpenCV version 3.4.1 compiled for Tegra with CUDA) and I get the following results:

Have Open CL  [ INFO:0] Initialize OpenCL runtime...
0

Use  Open CL  0
Have AMD Blas 0
Have AMD FFT  0
Have SVM      0
OpenCL is not available...

CudaEnabledDeviceCount 1

*** CUDA Device Query (Runtime API) version (CUDART static linking) *** 

CUDA Device count: 1

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 7846 MBytes (8227401728 bytes)
  GPU Clock Speed:                               1.30 GHz
  Max Texture Dimension Size (x,y,z)             1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
  Max Layered Texture Size (dim) x layers        1D=(32768) x 2048, 2D=(32768,32768) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version  = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1

Test cv::cuda::cvtColor     cicle count 150 cv::cuda::GpuMat 00:00:00.446 <- CPU (cv::Mat) 00:00:00.146, GPU (OpenCL cv::UMat) 00:00:00.150
Test cv::cuda::threshold    cicle count 150 cv::cuda::GpuMat 00:00:00.193 <- CPU (cv::Mat) 00:00:00.126, GPU (OpenCL cv::UMat) 00:00:00.139
Test cv::cuda::bitwise_or   cicle count 150 cv::cuda::GpuMat 00:00:00.172 <- CPU (cv::Mat) 00:00:00.159, GPU (OpenCL cv::UMat) 00:00:00.173
Test cv::cuda::minMaxLoc    cicle count 150 cv::cuda::GpuMat 00:00:00.918 <- CPU (cv::Mat) 00:00:00.834, GPU (OpenCL cv::UMat) 00:00:00.832
Test cv::cuda::findContours cicle count 150 cv::cuda::GpuMat 00:00:00.000 <- CPU (cv::Mat) 00:00:00.659, GPU (OpenCL cv::UMat) 00:00:00.662
Test cv::cuda::multiply     cicle count 150 cv::cuda::GpuMat 00:00:01.809 <- CPU (cv::Mat) 00:00:00.277, GPU (OpenCL cv::UMat) 00:00:00.280

Questions:
1 - Why do I get no performance gain when I run a custom benchmark on a Jetson TX2 board using CUDA, and get a performance penalty?
2 - Why does not the processing of cv::UMat happen on a GPU when I run a self-written benchmark on a Jetson TX2 board using CUDA, and I see no performance gain?

The CUDA performance was also tested with the “boxFilter” program with the “benchmark” start parameter from the examples with CUDA, the results:

./boxFilter Starting...

Loaded './data/lenaRGB.ppm', 1024 x 1024 pixels
GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2

[runBenchmark]: [CUDA Iterative Box Filter]

Running BoxFilterGPU for 150 cycles...

boxFilter-texture, Throughput = 270.0057 M RGBA Pixels/s, Time = 0.00388 s, Size = 1048576 RGBA Pixels, NumDevsUsed = 1, Workgroup = 64

Time = 0.00388 - The processing time of one frame using Only CUDA (looked at the source code)

The result of the work of a similar function (BoxFilter) in the OpenCV library (OpenCL, computation on the GTX 750 Ti with OpenCL) for the same image (picture Lena.ppm 1024x1024 px, 3 channel from the example boxFilter 7.4729e-05, which is 51 times faster than CUDA on the NVidia Jetson TX2 board.

Test cv::boxFilter cicle count 150 cv::UMat 00:00:00.011 time per frame 7.4729e-05
Test cv::boxFilter cicle count 150  cv::Mat 00:00:00.491 time per frame 0.00327866

Test cv::blur cicle count 150 cv::UMat 00:00:00.003 time per frame 2.03031e-05
Test cv::blur cicle count 150  cv::Mat 00:00:00.477 time per frame 0.0031854

Question:
1 - What am I doing wrong?
2 - How can I speed up GPU processing using CUDA?

WayneWWW · September 17, 2018, 6:47am

Bohdanms,

Sorry in advance if I didn’t get your point. I think you wanted to point out some performance issue on Jetson TX2 in comparison with GTX750.

Have you read the info for performance model setting in L4T document? Have you run jetson_clock.sh to pull up the gpu freq when testing?

Bohdanms · September 17, 2018, 8:22am

No, at first I compare OpenCV with OpenCL implementation and OpenCV with CUDA implementation. I dont understand why is OpenCV with OpenCL faster than OpenCV with CUDA? I have task compare perfomance OpenCV (CUDA and OpenCL implementations). Also, now I know what doing script 'jetson_clock.sh' and tests on Jetson TX2 board I included only for compare GTX 750 and CUDA on Jetson, because I didnt see perfomance when I use OpenCV with CUDA implementation on GeForce 750 Ti.

WayneWWW · September 17, 2018, 9:50am

Bohdanms,

I think it should be answered by other forum users or please file these topics to openCV forum.
We don’t own those modules in openCV, so cannot share correct answer for you.

The only thing I can share with you is how to enhance the performance model.

Bohdanms · September 17, 2018, 10:35am

It will be fine if you share with me is how to enhance the perfomance model.
Thank you.

WayneWWW · September 18, 2018, 1:55am

As I already told, please refer to our L4T document → Power Management for TX2/TX2i Devices