Performance issue on opencv TX1

Hi,

I’m using TX1 development board that was installed via jetpack (full installation).
While running the following code in python 2.7 the performance is about 20 times worse than on my mac book pro.
Moreover no GPU seems to be used and only one CPU is used for 100% (checked with tegrastats).

import cv2

img = cv2.imread('messi5.jpg')
img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
e1 = cv2.getTickCount()
for i in xrange(5,45,2):
        print(i)
        img = cv2.morphologyEx(img, cv2.MORPH_CLOSE, cv2.getStructuringElement(cv2.MORPH_RECT, (11,11)))
        img = cv2.morphologyEx(img, cv2.MORPH_OPEN, cv2.getStructuringElement(cv2.MORPH_RECT, (15,15)))
        im2, cnts, _ = cv2.findContours(img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
e2 = cv2.getTickCount()
t = (e2 - e1)/cv2.getTickFrequency()
print( t )

Is that what we should expect?
Can we improve this code performance on TX1?

Thanks
Tal

Hi Tal,

If you are directly using openCV from Jetpack, there is no gpu acceleration enabled. We cannot guarantee if the performance of your app would become better or not because don’t know how “cv2.findContours” works. Please check if this function supports CUDA.

Hi,

Putting the GPU/CUDA optimisations aside for a second, does it make sense that the same CPU-based implementation would be 20 times slower in the TX1’s CPU comparing to the Macbook’s CPU as Tal has noted?

By the way, a big factor of slowness was seen when just using medianBlur() function as well. The following code took 8s on a regular CPU, while in TX1 it took ~22s.

import cv2

img1 = cv2.imread('messi5.jpg')
e1 = cv2.getTickCount()
for i in xrange(5,15,2):
        print(i)
        img1 = cv2.medianBlur(img1,i)
e2 = cv2.getTickCount()
t = (e2 - e1)/cv2.getTickFrequency()
print( t )

Just trying to understand whether the numbers we’re seeing for CPU differences make sense?

Thanks!
Koby

koby.aizer,

Have you pull the cpu/emc frequency up? You could pull it by using “sudo ./jetson_clock.sh”

Hi Wayne,

Greatly appreciate your quick response.

Yes, the numbers I was mentioning were after I’ve ran jetson_closks.sh.
I’m also adding the output of tegrastats in case it helps:

RAM 717/3983MB (lfb 663x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,0%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@34.5C CPU@30C GPU@28C PLL@25.5C Tdiode@32C PMIC@100C Tboard@31C thermal@29.25C
RAM 717/3983MB (lfb 663x4MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1734,0%@1734,0%@1734,0%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35C CPU@30C GPU@28C PLL@25.5C Tdiode@32C PMIC@100C Tboard@31C thermal@29C
RAM 792/3983MB (lfb 651x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,37%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35C CPU@31C GPU@28C PLL@27C Tdiode@32C PMIC@100C Tboard@31C thermal@29.5C
RAM 804/3983MB (lfb 645x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 1%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@32C GPU@28.5C PLL@27.5C Tdiode@32.25C PMIC@100C Tboard@31C thermal@30C
RAM 879/3983MB (lfb 630x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 1%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@28.5C PLL@27.5C Tdiode@32.25C PMIC@100C Tboard@31C thermal@30C
RAM 891/3983MB (lfb 627x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 1%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@28.5C PLL@28C Tdiode@32.5C PMIC@100C Tboard@31C thermal@30C
RAM 902/3983MB (lfb 624x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@29C PLL@28C Tdiode@32.5C PMIC@100C Tboard@31C thermal@30C
RAM 913/3983MB (lfb 621x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@29C PLL@27.5C Tdiode@32.5C PMIC@100C Tboard@32C thermal@30.25C
RAM 913/3983MB (lfb 621x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@29C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.25C
RAM 876/3983MB (lfb 631x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,1%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@29C PLL@27.5C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.5C
RAM 888/3983MB (lfb 627x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@29C PLL@27.5C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.5C
RAM 900/3983MB (lfb 625x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@29C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.25C
RAM 912/3983MB (lfb 621x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@32C GPU@29C PLL@27.5C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.75C
RAM 913/3983MB (lfb 621x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@36C CPU@32C GPU@29C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.5C
RAM 875/3983MB (lfb 628x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@36C CPU@31.5C GPU@29C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.25C
RAM 887/3983MB (lfb 626x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,1%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@36C CPU@31.5C GPU@29C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.25C
RAM 899/3983MB (lfb 625x4MB) IRAM 0/252kB(lfb 252kB) CPU [1%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@36C CPU@32C GPU@29C PLL@27.5C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.25C
RAM 911/3983MB (lfb 621x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@36C CPU@32C GPU@29C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.5C
RAM 913/3983MB (lfb 621x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@29C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.5C
RAM 874/3983MB (lfb 631x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@36C CPU@31.5C GPU@29C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.5C
RAM 886/3983MB (lfb 628x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@29.5C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.75C
RAM 898/3983MB (lfb 625x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31.5C GPU@29.5C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.25C
RAM 911/3983MB (lfb 622x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,1%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@36C CPU@32C GPU@29C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.5C
RAM 913/3983MB (lfb 621x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,100%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@36C CPU@32C GPU@29C PLL@28C Tdiode@32.75C PMIC@100C Tboard@32C thermal@30.5C
RAM 717/3983MB (lfb 663x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,46%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35.5C CPU@31C GPU@29C PLL@26.5C Tdiode@32.25C PMIC@100C Tboard@32C thermal@30C
RAM 717/3983MB (lfb 663x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,0%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35C CPU@30.5C GPU@28.5C PLL@26C Tdiode@32C PMIC@100C Tboard@32C thermal@29.5C
RAM 717/3983MB (lfb 663x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,0%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35C CPU@30.5C GPU@28.5C PLL@26C Tdiode@32.25C PMIC@100C Tboard@32C thermal@29.5C
RAM 717/3983MB (lfb 663x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,0%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@35C CPU@30C GPU@28C PLL@25.5C Tdiode@32C PMIC@100C Tboard@32C thermal@29.5C
RAM 717/3983MB (lfb 663x4MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1734,0%@1734,0%@1734,0%@1734] EMC_FREQ 0%@1600 GR3D_FREQ 0%@998 APE 25 AO@34.5C CPU@30C GPU@28C PLL@25.5C Tdiode@32C PMIC@100C Tboard@31C thermal@29.25C

Thanks,
Koby

koby.aizer,

Looks like cpu has been used up to 100%… Then I think it may be the limitation of this app if still using single thread.

Hi Wayne,

Yes, this app is CPU bound so it consumes 100%. The question is whether it makes sense that in the regular CPU it can process 3 times more frames (or in Tal’s example, 20 times more frames)

Just to make sure I understand your answer correctly - it means that the TX1 CPU performance is largely different from a regular CPU, and there isn’t some configuration we’re missing?

Thanks,
Koby

You may try to build your own version of opencv and retest. You may find some good links for this here.