Deploying caffe model on Jetson tx2 with the help of TensorRT

hello everyone,
I have tensorRT engine( from caffe model. Can I deploy this engine directly onto jetson tx2 or do I have to do any modifications?

Hi shubham7494, the TensorRT engine should be built on the TX2, because TensorRT performs GPU-specific profiling and optimizations at this phase. But yes, after the TensorRT CUDA engine is built for your caffemodel, you can deploy it at runtime without further modification.

See here for a TensorRT code example which runs on Jetson TX1/TX2:

Hello dusty_nv,
Do I have to copy the caffemodel on TX2 and build engine there?
and Does TX2 support python api to build the engine for caffemodel or Do I have to use c++ api?
Thank you.

Hi shubh, you’ll need to copy the caffemodel to a TX2 and build the engine there.
You can then save the engine to your own file and load it again in the future, saving time, or copy it to run on other TX2’s.

Jetson/ARM does not currently support the TensorRT python API, so it would be done through C++ API at this time.

Thank you dusty_nv.

Hi dusty_nv,
when I try to build engine I am getting following error at function call (buildCudaEngine) C++ api tensorRT.

*** Error in `./sample_CIFAR10_debug’: free(): invalid next size (fast): 0x00002afdef67ca20 ***
Aborted (core dumped)
make: *** [test_debug] Error 134

Please help.
(for same caffemodel and proto file I am able to build engine using python api)


Looks like there is some incorrect handling in your application.
It’s recommended to modify from our standard sample to figure out the problem.




Hi AastaLLL,
I am working with sample programs only, even sampleInt8 giving me same error.


INT8 is only available on 6.1 GPU architecture, not for TX2 which is on 6.2.
Do you meet this error in other samples?



Yes for sampleGoogleNet too I am getting same error. Also if I use other dataset and model, and modify sampleMNIST file accordingly still it gives the same error.


for giexec commandline too its giving same error.

shubham@pas-lab-server5:~/TensorRT-3.0.2/bin$ ./giexec --deploy=/home/shubham/TensorRT-3.0.2/data/mnist/mnist.prototxt --model=/home/shubham/TensorRT-3.0.2/data/mnist/mnist.caffemodel --output=prob --half2 --engine=/home/shubham/TensorRT-3.0.2/new.engine
deploy: /home/shubham/TensorRT-3.0.2/data/mnist/mnist.prototxt
model: /home/shubham/TensorRT-3.0.2/data/mnist/mnist.caffemodel
output: prob
engine: /home/shubham/TensorRT-3.0.2/new.engine
Input “data”: 1x28x28
Output “prob”: 10x1x1
Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
*** Error in `./giexec’: free(): invalid next size (fast): 0x00007f5f2f3b1660 ***
Aborted (core dumped)


Looks like you are not on a Jetson platform.
Could you share the deviceQuery information with us first?

/usr/local/cuda-9.0/bin/ .
cd NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery


Yeah I am not working on tx2 yet.
This is result of ./deviceQuery

shubham@pas-lab-server5:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Tesla K20Xm”
CUDA Driver Version / Runtime Version 9.0 / 8.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 5700 MBytes (5976424448 bytes)
(14) Multiprocessors, (192) CUDA Cores/MP: 2688 CUDA Cores
GPU Max Clock rate: 732 MHz (0.73 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 132 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K20Xm
Result = PASS


Your CUDA driver and CUDA runtime are in the different version.
>> CUDA Driver Version / Runtime Version 9.0 / 8.0

Please set up your environment with identical CUDA version first:

For CUDA8.0, it should be
[i]>> CUDA Driver Version / Runtime Version 8.0 / 8.0

TensorRT 3.0.2 for Ubuntu 1604 and CUDA 8.0 DEB local repo packages[/i]

For CUDA9.0, it should be
[i]>> CUDA Driver Version / Runtime Version 9.0 / 9.0

TensorRT 3.0.2 for Ubuntu 1604 and CUDA 9.0 DEB local repo packages[/i]