What optimization solutions/tools are available for my Python code?


I am trying to use an opensource framework (OpenFace), using computer vision and neural nets, on the Jetson TX2. I wrote Python code which makes use of this framework. My current implementation on my laptop’s CPU, needs between 3.679sec and 4.805sec to provide a result, while on the Jetson the same code needs around 43 seconds!

I tried following the steps suggested on this link, ie using Anaconda: https://developer.nvidia.com/how-to-cuda-python . But it seems like the Jetson’s architecture isn’t supported. When running the .sh file I downloaded from anacondas website, I get:

"cannot execute binary file: Exec format error. ERROR: cannot execute native linux-64 binary, output from ‘uname -a’ is:

Linux tegra-ubuntu 4.4.15-tegra #1 SMP PREEMPT Wed Feb 8 18:06:32 PST 2017 aarch64 GNU/Linux"

No matter whether I use the 32 or 64 bit. The issue stays the same.

I heard there were some CUDA compilers/profilers available that allow some sort of significant “automatic” optimization without me having to rewrite my code, if I understood it correctly.

What solutions are there for me in order to speed up my python code without necessarily having to rewrite everything?

On a tangent, I am using Python2.7.12


No, nvcc does not “automatically” re-write your code for GPUs.

Python is popular because it’s easy-ish to learn, and easy-ish to develop smaller programs in, and it has a lot of features already built in.

However, Python is very slow on things that actually require heavy number crunching. This is why libraries like OpenCV and NumPy and so forth use plug-ins built in C, that expose a Python API. You will call their functions from Python, but the heavy number crunching happens in C. Thus, the first step to improving your performance would be to write the inner core of your algorithm in C. You can easily get a 10x speed-up in program execution time that way (more, if the algorithm is somehow pessimal in Python.)

Also, Python is not parallel multi-threaded. It cannot spread computation across all six cores on the TX2; it will end up just computing on a single core. You can use threads to asynchronize I/O, but not for computation, because of the global interpreter lock (“GIL”) and the way the Python data structures are implemented internally. Python – easy to learn, very slow, breaks down if you try to build really large applications in it.

Then, there’s the matter of using GPUs. The nvcc tool does let you build code that runs on the GPU, and for the right kind of algorithm, implemented correctly. You can write nvcc code for CUDA, that you then expose to Python by building a shim C library. If you have proper parallelism, you can get 100x speed-up this way. But it requires that you know all three systems: CUDA/nvcc, C plugins, and Python.


First of all thanks a lot for your answer.

What happens is that the framework’s interface is in Python (Unless I missed something?). Would that mean that the solution to this would be that I entirely rewrite their interface/framework (which communicates with Dlib, openCV and Torch) in C/C++? My current draft Python code (https://pastebin.com/DqbkhJNW) is,to some extent, very close to this script they made available: https://github.com/cmusatyalab/openface/blob/master/demos/classifier.py

As I perform the same algorithm on a series of independent pictures (eg 20) all at once (eg on interrupt basis). I was considering to use some parallel solution, like CUDA, in order to do those 20 operations in parallel. That way rather than waiting 20*time_to_process_1_pic, I’d only have to wait 1/20th of this time.

Would it be possible for you to share your views on my planned approach?

Looking forward to hearing from you