TensorRT's nvinfer1::INetworkDefinition::addFullyConnected() does not work as expected for C3D network

Thanks a lot for your feedback.

WRT “add a flatten layer”, it looks like nvinfer1::INetworkDefinition Class has no such API, do I need to implement it with addPluginV2() by myself or just call addConvolutionNd() with kernelSize (1,1,1) ?


No. Flatten layer should be a special case of reshape(Shuffle) layer:

You can find an example in the onnx2trt source:


Thanks a lot ! as per your guide, after I added a shuffle layer and reshaped the shape of the output tensor of fc5 layer to (4096,1,1) by shuffleLayer’s setReshapeDimensions(), now network inference can work.

Question on two points:

  1. Against the original video-caffe network C++ code running on Nano, my this network implemented with TensorRT API has a little precision degradation (p.s., other onnx model has similar problem when it is parsed into TensorRT engine and doing inference with the engine), although my this network uses fp32, not fp16 or int8;
  2. There is also some performance degradation, it looks like TensorRT doesn’t use cudnn by default, so, the 3D convolution performance is not so good as that of the original video-caffe network, whose 3d convolution is implemented with cudnn API.

Do you have any suggestion about improving the inference precision and 3dconv performance of the network implemented with TensorRT API? thanks.


1. There are two possible reasons for this:

  • Different input image:
    A common issue is from the data preprocessing step.
    Please check if they are using the same data format, mean subtraction and normalization procedure.

  • Some difference between TensorRT API and Caffe.
    It’s recommended to check if the pooling approximation discussed above causes any precision issue or not.

2. TensorRT is implemented mainly with cuDNN.
Do you have any extra memcpy when deploying a layer with TensorRT API?
This may cause some performance issue compared to the cuDNN.


Thanks for your feedback, both video-caffe and my this network use the same input image data, as they use the same images and have the same processInput() (see below), my inference code for this network is as following :

vector<float> TRTC3D::infer(vector<Mat>* p_cvImgs, int nClasses)
    // Create RAII buffer manager object
    samplesCommon::BufferManager buffers(mEngine, mParams.batchSize);

    auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
    if (!context)
        return vector<float>();

    // Read the input data into the managed buffers
    // There should be just 1 input tensor
    assert(mParams.inputTensorNames.size() == 1);
    if (!processInput(buffers, p_cvImgs ))
        return vector<float>();
#if  1
    // Create CUDA stream for the execution of this inference.
    cudaStream_t stream;

    // Asynchronously copy data from host input buffers to device input buffers

    // Asynchronously enqueue the inference work
    if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
        return vector<float>();
    // Asynchronously copy data from device output buffers to host output buffers

    // Wait for the work in the stream to complete

    // Release stream

   bool status = context->execute(mParams.batchSize,buffers.getDeviceBindings().data());
   if (!status) {
     return vector<float>();

   vector<float> result = processOutput(buffers,nClasses);
   return result;

to use cuDNN, do I need to replace the cudaxxx() API called above with cudnnxxx() API ? if yes, is there a good/complete sample source code ? I find some code in /usr/src/tensorrt/samples/samlePlugin/fcPlugin.h, but it looks like the code is incomplete.


To figure out the root cause of the difference will need to investigate the output layer by layer.
A common issue is that some parameter is missing or not be set correctly.
Ex. coefficient in the eltwise layer.

Not sure which API do you imply.
For inference API, TensorRT API is built on the cuDNN, and it should be nvinfer1::....

Some basic API, like cudaStreamCreate, is used for controlling the workflow.
You don’t need to convert it into cudnn_


My network, which I pasted out in 23/Sep, is totally constructed by nvinfer:: API, it is really in such a function:

bool TRTC3D::constructNetwork(
    SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, SampleUniquePtr<nvinfer1::INetworkDefinition>& network)

and the code which call this function to build the network is as following:

    auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
    if (!builder)
        return false;

    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetwork());
    if (!network)
        return false;

    auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
    if (!config)
        return false;

    auto parser = SampleUniquePtr<nvcaffeparser1::ICaffeParser>(nullptr);   

    if (!constructNetwork(parser, network))
        return false;

for the inference code, as the above code I pasted out in shows 15/Oct, which are also almost copied from your example code, while doing inference, first process input data then enqueue data into GPU deivce’s buffer and then execute and process the output data.

By observing the memory, I can see, only when cudnxxx() API is forcibly called in my TRTC3D::infer(), such as “cudnnCreate(&cudnn);”, the memory occupation has 700+ M more than that when “cudnnCreate(&cudnn);” is commented out. So, looking from memory occupation, I guess cuDNN library is not used by default, unless cudnnxxx() API is called.

Do you mean, when “nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger())” is called, cuDNN library is loaded automatically ?


Most of our library is configured as on-demand.
So TensorRT will load the cuDNN library until inferencing rather than building time.


But I never saw memory occupation increased very much when I did inference with my this network implemented with TensorRT API. I saw an about 700M more memory was occupied if I forcibly added a call on cudnnxxx() in the inference code of my this network, just like video-caffe does as it calls cudnnxxx() in its convolution layers.
I observed the memory increasement by jtop or by adding code suggested here :


Sorry for the late.
Do you mean the memory for TensorRT API doesn’t occupy the similar amount of memory?
If yes, would you mind to share a sample for TensorRT API vs. cuDNN API with us?

Please noted that the library is not loaded when engine building time.
It will be loaded until inferencing.



My inference code was pasted above on 15/Oct, and to trigger cudnn to be used, as your example code , I only added :

cudnnHandle_t handle_;

then I saw program took much more memory, the increased memory amounted to 700+ M.



Yes, the memory is used for loading cuDNN memory and it takes 600+ memory.

It should be similar in TensorRT.
For the layer with cuDNN implementation, the same library will be loaded when inferencing.


But while in doing inference with my network implemented with TensorRT API, I didn’t find memory occupation increased largely, I observed with jtop.
Besides coding network with TensorRT API, is there anything that needs to be configured ? thanks.


Could you run the nvprof for your implementation to see the detailed backend API first?

$ sudo /usr/local/cuda-10.2/bin/nvprof [your app]

Although TensorRT leverage cuDNN basically, some operations might use other library instead.
Here is my profiling result for the sample_mnist and you can see it mainly use the cuBLAS(gemm) and cuDNN:

==20592== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   11.73%  4.9390ms       296  16.685us     448ns  194.47us  [CUDA memcpy HtoD]
                   10.66%  4.4887ms       149  30.125us     416ns  330.29us  [CUDA memset]
                    3.77%  1.5858ms         8  198.23us  151.85us  247.60us  trt_volta_sgemm_128x128_relu_nn_v1
                    3.66%  1.5408ms        23  66.990us  14.368us  143.85us  void cudnn::cnn::conv2d_grouped_direct_kernel<float, float, float, float, float, float, bool=1, bool=0, int=0, int=0, int=0>(cudnnTensorStruct, float const *, cudnnFilterStruct, float const *, cudnnConvolutionStruct, cudnn::cnn::conv2d_grouped_direct_kernel<float, float, float, float, float, float, bool=1, bool=0, int=0, int=0, int=0>, float*, float, float*, cudnn::reduced_divisor, float, float, float, float, int, cudnnConvolutionStruct const *, float const *, cudnnActivationStruct)
                    3.11%  1.3091ms         8  163.63us  83.940us  245.68us  trt_volta_sgemm_64x64_relu_nn_v1


I had a try to run nvprof with our app on Jetson Nano on which our app often runs, didn’t get the output that you showed, but errors happened, please see the messages in the picture.


Could you run the app (./bright) as root but without using nvprof?
It might be some issue if you launch deepstream as root but remotely since it usually requires a DISPLAY connection.


Hello, we often run ./bright well by root user, crash was never seen.
But, if running ./bright with nvprof, bright always ran for a while and GUI was seen and object detection also was working, then app crashed and the error was that in the screen-shot picture.
Just now I had a try locally on Jetson Nano board, also got the same result, please see the attached image, thanks.


Unfortunately, it seems there is some issue in the nvprof of CUDA 10.2.
As alternative, would you mind to check if you can run nsight system for the app?


I have been trying to download nsight installer deb package by the latest sdkmanager, but sdkmanager always could not get the configuration file from your cloud server in Step2, I think it is not because of my network problem as I could access google, twitter, youtube, etc… I’ll update you someday once your server gets to work fine.