Beginner Question (Xavier/TX2/Nano): Python, PyCuda, TensorRT memory allocations vs C++

Hi everyone,

I am very new to machine learning and GPU programming. I am working on developing an application that uses pre-trained models (.caffemodel, .prototxt, .uff) that I would like to optimize and run real-time using TensorRT. I am getting confused while trying to determine the best method for developing this application.

When I look into TensorRT examples from NVIDIA, I see that it is possible to use TensorRT from Python, using libraries like pyCuda (, however when I look in to the “Hello AI World” tutorial, I see that he creates his own C++ CUDA programs and wraps them to make them available to Python though an API (

For a professional application, what is the best/recommended approach? Developing the whole application in C++, develop partially in C++ and develop higher level network connections in Python, or develop only on Python?

What are good resources to get started not only in machine learning, but making deployable application for the Jetson platform?

On the Jetson platform, the GPU shares memory with the CPU, how does pyCuda allocate memory on the device if the memory is shared?

I know these are a lot of question, and the answer might vary depending on the application, but I would greatly appreciate any help or references which I can use to learn the best way to create deployable machine learning applications


Please noticed that all the GPU jobs need to be implemented with CUDA.
We provide lots of library with C++ API, which calling our own CUDA implementation.

For your use case, it’s recommended to use C++ interface.
Since python API is another wrapping of our C++ library and may introduce a little bit latency.
You can start the application from our jetson_inference sample:

There is virtual memory system for Jetson.
In general, both CPU and GPU have their own memory address but mapping to the same physical memory.