Tensor RT supports caffe model layers

Does Tensor RT support slice, prelu, batchnorm, eltwise, … layers of caffe model?
Could we have the source of tensorRT and add the layers by ourselves?

Thank you,


Thanks for your question.

TensorRT supports following layer type:

Convolution: 2D
Activation: ReLU, tanh and sigmoid
Pooling: max and average
ElementWise: sum, product or max of two tensors
LRN: cross-channel only
Fully-connected: with or without bias
SoftMax: cross-channel only

Read more at: https://devblogs.nvidia.com/parallelforall/production-deep-learning-nvidia-gpu-inference-engine/

TensorRT does not support custom layers. However, you can add your own layer in the TensorRT flow.
For example

IExecutionContext *contextA =
IExecutionContext *contextB =
contextA.enqueue(batchSize, buffersA, stream, nullptr);
myLayer(outputFromA, inputToB, stream);
contextB.enqueue(batchSize, buffersB, stream, nullptr);

Hi AastaLLL,

Thank you for your prompt reply. We will test it.

Hi AastaLLL,

Thank you for your reply. Do I add the ROI-Pooling layer in the giexec.cpp file at /usr/src/gie_samples/samples/giexec ? Also, is there an example on faster RCNN that I can learn from?

I tried to refer to https://devtalk.nvidia.com/default/topic/990426/?comment=5114680 but it is based on YOLO.

Lastly, do I have to add anything to the ConvertCaffeToGieModel_main.cpp file?

Thank you

Does NVIDIA intend to release new version Tensor RT that support more caffe layers, such as BN, PRelu? If so, when is it? Thanks.

Hi leoncss92,

You can refer to our tensorRT sample which is located at ‘/usr/src/gie_samples/’.

For example,
Separate your network to: input -> networkA -> networkSelf -> networkB -> output

NetworkA and networkB can inference directly via tensorRT.
NetworkSelf needs to be implemented via CUDA.

So, the flow will be:

IExecutionContext *contextA = engineA->createExecutionContext(); //create networkA
IExecutionContext *contextB = engineB->createExecutionContext(); //create networkB
contextA.enqueue(batchSize, buffersA, stream, nullptr);  //inference networkA
myLayer(outputFromA, inputToB, stream);                  //inference networkSelf, your cuda code is here!
contextB.enqueue(batchSize, buffersB, stream, nullptr);  //inference networkB


Hi maoxiuping,

FasterRCNN and ResNet is in our implementation plan but we can’t disclosure the schedule.
Please pay attention to our announcement and update.


Hi AastaLLL,
thanks for your reply,I want to use TensorRT in scene classifier work,but I just can get TensorRT 1.0 on NVIDIA official website,how can I get 2.0?


TensorRT 2.0 is not for Jetson platform, it only release to desktop GPU user.
Currently, the latest tensorRT for Jetson is v1.0. For newer version, please wait for announcement.


Hi AastaLLL,

Thanks very much for your reply, now I just download and install the TensorRT1.0,and my GPU is GTX1080,cuda is 8.0, I do not have the Jetson platform. Can I use TensorRT1.0 on GTX1080?

and i have read the samples of TensorRT1.0 about sampleGoogleNet,I can run the sample and display the layer time, but the code run inference with null data to time network performance,I want to know how can I test my own picture?

by the way, Can I down the TebsorRT2.0 ? and where can I got it?



Make sure you download desktop version tensorRT and it can support GTX* GPU.
TensorRT2.0 early access is over. Please wait for our next release(v3.0).

For inference, we have a sample to demonstrate how to use tensorRT:

(Notice: you need to add gpu architecture into the CMakeLists.txt since it only contains Jetson GPU)

By the way, here is embedded board. You can get better support if you post question on the correct board.


I tried optimizing a network by splitting it into 2 parts, but my code fails when building the second engine. However, when optimizing the network as a whole, I do not encounter any problems.

Specifically, it fails at this particular call (within the caffetoGIEModel call for the 2nd engine) which does not return anything and the code just stops.

const IBlobNameToTensor* blobNameToTensor = parser->parse(locateFile(deployFile).c_str(),locateFile(modelFile).c_str(),*network,DataType::kFLOAT)

I switched the order of the prototxts, used different .caffemodel files, first serialized both and deserialized both, first serialized and deserialized one and then the other, tried literally copy pasting what I do for the first engine to avoid any unwanted communication between the 2 engines, …, but the issue remains.

Any clues as to what might be going wrong? Thanks!

Hi qww,

Could you share error log for us checking? (Please open verbose)

Which mode do you use? Float or half(fp16)?


Hi AastaLLL,

Here is the error log,but unfortunately it doesn’t really help. I included print statements of my own as well to be able to situate within the code.

We are using float mode.

creating A
1CaffetoGIE: called correctly
2CaffetoGIE: builder created
3CaffetoGIE: network defined
4CaffetoGIE: network parsed
5CaffetoGIE: blob converted to tensor
6CaffetoGIE: right before setting builder size 
7CaffetoGIE: right after setting builder size 
Original: 5 layers
After dead-layer removal: 5 layers
After scale fusion: 5 layers
After conv-act fusion: 5 layers
After tensor merging: 5 layers
After concat removal: 5 layers
Region scale: NCHW_F32
Region conv1: NCHW_F32
Region pool1: NCHW_F32
Region conv2: NCHW_F32
Region data: NCHW_F32
Region scale: NCHW_F32
Region conv1: NCHW_F32
Region pool1: NCHW_F32
Region conv2: NCHW_F32
Region pool2: NCHW_F32

Node scale: NCHW_F32
Node conv1: NCHW_F32
Node pool1: NCHW_F32
Node conv2: NCHW_F32
Node pool2: NCHW_F32

After reformat layers: 5 layers
Block size 1048576
Block size 46080
Block size 11520
Total Activation Memory: 1106176

--------------- Timing scale(10)
Tactic 0 is the only option, timing skipped

--------------- Timing conv1(3)
Tactic 0 time 0.017408

--------------- Timing conv1(2)
Tactic 5 time 0.018432
Tactic 18 time 0.024096
Tactic 23 time 0.024576
Tactic 72 time 0.034528
Tactic 73 time 0.018304
Tactic 77 time 0.017056
Tactic 99 time 0.01024
Tactic 100 time 0.017216
Tactic 141 time 0.011264
Tactic 142 time 0.01488
Tactic 147 time 0.012288

--------------- Timing conv1(1)
Tactic 0 time 0.05904
Tactic 1 time 0.028608
Tactic 2 time 0.036864
Tactic 4 time 0.071296
Tactic 5 time 0.131744
--------------- Chose 2 (99)

--------------- Timing pool1(8)
Tactic 5505281 time 0.012288
Tactic 5570817 time 0.007168
Tactic 5636353 time 0.006656
Tactic 5701889 time 0.00784
Tactic 5767425 time 0.007712
Tactic 5832961 time 0.007808
Tactic 5898497 time 0.007744
Tactic 5964033 time 0.008192
Tactic 6029569 time 0.006112
Tactic 6095105 time 0.006144
Tactic 6160641 time 0.006144
Tactic 6226177 time 0.007072
Tactic 6291713 time 0.007168
Tactic 6357249 time 0.007168
Tactic 6422785 time 0.008192
Tactic 6488321 time 0.00768

--------------- Timing conv2(3)
Tactic 0 time 0.060416

--------------- Timing conv2(2)
Tactic 5 time 0.027488
Tactic 13 time 0.036864
Tactic 16 time 0.054176
Tactic 18 time 0.070528
Tactic 23 time 0.0512
Tactic 57 time 0.018432
Tactic 58 time 0.02048
Tactic 62 time 0.027648
Tactic 67 time 0.0256
Tactic 71 time 0.027232
Tactic 73 time 0.033792
Tactic 77 time 0.03264
Tactic 80 time 0.041216
Tactic 92 time 0.025312
Tactic 99 time 0.026624
Tactic 100 time 0.025088
Tactic 113 time 0.024576
Tactic 116 time 0.022528
Tactic 120 time 0.018368
Tactic 133 time 0.024256
Tactic 140 time 0.019456
Tactic 141 time 0.021504
Tactic 142 time 0.028416
Tactic 146 time 0.016384
Tactic 147 time 0.048128
Tactic 148 time 0.021504
Tactic 154 time 0.023552
Tactic 161 time 0.018432
Tactic 165 time 0.03072

--------------- Timing conv2(1)
Tactic 0 time 0.086656
Tactic 1 time 0.08192
Tactic 2 time 0.099232
Tactic 4 scratch requested: 2450080, available: 1048576
Tactic 5 scratch requested: 4681216, available: 1048576
--------------- Chose 2 (146)

--------------- Timing pool2(8)
Tactic 5505281 time 0.009216
Tactic 5570817 time 0.007168
Tactic 5636353 time 0.006144
Tactic 5701889 time 0.006976
Tactic 5767425 time 0.007072
Tactic 5832961 time 0.008192
Tactic 5898497 time 0.008192
Tactic 5964033 time 0.008896
Tactic 6029569 time 0.006144
Tactic 6095105 time 0.006144
Tactic 6160641 time 0.006784
Tactic 6226177 time 0.006752
Tactic 6291713 time 0.00768
Tactic 6357249 time 0.007872
Tactic 6422785 time 0.008704
Tactic 6488321 time 0.008192
created A
creating B
2-1CaffetoGIE: called correctly
2-2CaffetoGIE: builder created
2-3CaffetoGIE: network defined
2-4CaffetoGIE: network parsed



Could you try tensorRT2.1 first since we have fixed several issues?
Please install via JetPack3.1:


We are using TensorRT2.1 already. We want to implement this approach instead of a plugin approach because it should be much easier, as it is very unclear how to actually define what the plugins should do in tensorrt at runtime.


There is a sampleFasterRCNN you could refer to from /usr/src/tensorrt/samples after installing JetPack 3.1 package pointed out by AastaLLL. Thanks.

Hi Chijen,

We already looked at this example but all of the work happens in the obscure


function, of which we are not provided with the source code. Therefore, it is very unclear how to actually make plugins work because the examples don’t explain the majority of the work.



For plug-in details, please check to FCPlugin class in samplePlugin.

Currently, I am trying to do inference of the SSD model(https://github.com/weiliu89/caffe/tree/ssd) using TensorRT, but there are several layers missing, like “PriorBox”, “Permute”. Do you have any plans to fix this? Or can you provide more flexible apis for me to implement more types of layers?