What optimization solutions/tools are available for my Python code?

alexandrekapoen509 · September 8, 2017, 9:29am

Hello

I am trying to use an opensource framework (OpenFace), using computer vision and neural nets, on the Jetson TX2. I wrote Python code which makes use of this framework. My current implementation on my laptop’s CPU, needs between 3.679sec and 4.805sec to provide a result, while on the Jetson the same code needs around 43 seconds!

I tried following the steps suggested on this link, ie using Anaconda: https://developer.nvidia.com/how-to-cuda-python . But it seems like the Jetson’s architecture isn’t supported. When running the .sh file I downloaded from anacondas website, I get:

"cannot execute binary file: Exec format error. ERROR: cannot execute native linux-64 binary, output from ‘uname -a’ is:

Linux tegra-ubuntu 4.4.15-tegra #1 SMP PREEMPT Wed Feb 8 18:06:32 PST 2017 aarch64 GNU/Linux"

No matter whether I use the 32 or 64 bit. The issue stays the same.

I heard there were some CUDA compilers/profilers available that allow some sort of significant “automatic” optimization without me having to rewrite my code, if I understood it correctly.

What solutions are there for me in order to speed up my python code without necessarily having to rewrite everything?

On a tangent, I am using Python2.7.12

Thanks!

snarky · September 9, 2017, 3:35am

No, nvcc does not “automatically” re-write your code for GPUs.

Python is popular because it’s easy-ish to learn, and easy-ish to develop smaller programs in, and it has a lot of features already built in.

However, Python is very slow on things that actually require heavy number crunching. This is why libraries like OpenCV and NumPy and so forth use plug-ins built in C, that expose a Python API. You will call their functions from Python, but the heavy number crunching happens in C. Thus, the first step to improving your performance would be to write the inner core of your algorithm in C. You can easily get a 10x speed-up in program execution time that way (more, if the algorithm is somehow pessimal in Python.)

Also, Python is not parallel multi-threaded. It cannot spread computation across all six cores on the TX2; it will end up just computing on a single core. You can use threads to asynchronize I/O, but not for computation, because of the global interpreter lock (“GIL”) and the way the Python data structures are implemented internally. Python – easy to learn, very slow, breaks down if you try to build really large applications in it.

Then, there’s the matter of using GPUs. The nvcc tool does let you build code that runs on the GPU, and for the right kind of algorithm, implemented correctly. You can write nvcc code for CUDA, that you then expose to Python by building a shim C library. If you have proper parallelism, you can get 100x speed-up this way. But it requires that you know all three systems: CUDA/nvcc, C plugins, and Python.

alexandrekapoen509 · September 9, 2017, 10:29am

Hi,

First of all thanks a lot for your answer.

No, nvcc does not “automatically” re-write your code for GPUs.

Python is popular because it’s easy-ish to learn, and easy-ish to develop smaller programs in, and it has a lot of features already built in.

However, Python is very slow on things that actually require heavy number crunching. This is why libraries like OpenCV and NumPy and so forth use plug-ins built in C, that expose a Python API. You will call their functions from Python, but the heavy number crunching happens in C. Thus, the first step to improving your performance would be to write the inner core of your algorithm in C. You can easily get a 10x speed-up in program execution time that way (more, if the algorithm is somehow pessimal in Python.)

…

You can write nvcc code for CUDA, that you then expose to Python by building a shim C library.
…

What happens is that the framework’s interface is in Python (Unless I missed something?). Would that mean that the solution to this would be that I entirely rewrite their interface/framework (which communicates with Dlib, openCV and Torch) in C/C++? My current draft Python code (https://pastebin.com/DqbkhJNW) is,to some extent, very close to this script they made available: https://github.com/cmusatyalab/openface/blob/master/demos/classifier.py

As I perform the same algorithm on a series of independent pictures (eg 20) all at once (eg on interrupt basis). I was considering to use some parallel solution, like CUDA, in order to do those 20 operations in parallel. That way rather than waiting 20*time_to_process_1_pic, I’d only have to wait 1/20th of this time.

Would it be possible for you to share your views on my planned approach?

Looking forward to hearing from you

Topic		Replies	Views
Jetson Xavier NX Cuda-enabled OpenCV Python Jetson Xavier NX opencv , python	8	2950	October 27, 2021
OpenCV,CUDA,Python? Jetson Nano opencv	4	2046	October 14, 2021
Too slow OPENCV with CUDA compiled, why? Jetson Nano opencv	5	4995	October 18, 2021
CUDA code too slow Jetson Nano cuda	6	1831	July 26, 2022
Python GPU-accelerated CV library (whether OpenCV or other)? Jetson TX2	6	7194	October 18, 2021
how execute Python code on GPU Graph Jetson TX1 Jetson TX1	7	1614	June 15, 2018
Extremely slow CUDA API calls? Jetson TX1	6	2899	October 18, 2021
Pycuda runs super slow on Jetson Xavier NX compared to running on CPU Jetson Xavier NX pycuda	8	1854	October 18, 2021
OpenCV, Python and GPU on TK1 Jetson TK1 opencv	20	6052	July 13, 2017
CUDA is so slow Jetson Nano opencv	5	1333	June 30, 2022

What optimization solutions/tools are available for my Python code?

Related topics