TensorRT's nvinfer1::INetworkDefinition::addFullyConnected() does not work as expected for C3D network

Hello, Dear NVIDIA teams,

My hardware/software info is :
• Hardware Platform (Jetson Nano or NX)
**• DeepStream Version 5.0
• JetPack Version (4.3)
• TensorRT Version (7.1.3)
• NVIDIA GPU Driver Version (valid for GPU only)

I’m trying to implement the network video-caffe (https://github.com/chuckcho/video-caffe), which is basing on Facebook’s C3D and used for action video classification.

The network definition/structure is simple, please see https://github.com/chuckcho/video-caffe/blob/master/examples/c3d_ucf101/c3d_ucf101_deploy.prototxt

The input data is 16 colorful images (CHW is 3x112x112), I coded the network with(because of business proprietary, some details are omitted, but key statements are listed out):

nvinfer1::ITensor* data = network->addInput(mParams.inputTensorNames[0].c_str(), nvinfer1::DataType::kFLOAT, nvinfer1::Dims{4,{3, 16, 112, 112},{}});

data = createLayerBlock(network,data,“conv1a”,“relu1a”,“pool1”,64,dims_kernel,dims_sp,dims_pooling,mWeightsMap[“conv1a”][0],mWeightsMap[“conv1a”][1]);

data = createLayerBlock(network,data,“conv5a”,“relu5a”,“pool5”,256,dims_kernel,dims_sp,dims_pooling,mWeightsMap[“conv5a”][0],mWeightsMap[“conv5a”][1]);
nvinfer1::IFullyConnectedLayer* fcLayer1 = network->addFullyConnected(*data,2048,mWeightsMap[“fc6”][0],mWeightsMap[“fc6”][1]);

So, as defined in c3d_ucf101_deploy.prototxt , the “conv5a” convolution-relu-pooling block has 256 output channels, and the “fc6” FullyConnected layer’s num_output is 2048, considering the data is 16 images, the parameters count should be 256x16x2048 = 8388608, but TensorRT requires the parameter count is 18432 = 2048 x 9, if trying to copy all the 8388608 parameters in mWeightsMap[“fc6”][0] for “fc6” layer, then error happened:

fc6: kernel weights has count 8388608 but 18432 was expected
Could not compute dimensions for (Unnamed Layer* 15) [Fully Connected]_output, because the network is invalid.
Network validation failed.

mWeightsMap[“fc6”][0] has 8388608 parameters which is parsed from the weights file c3d_ucf101_iter_20000.caffemodel which is saved out by video-caffe model training with ucf101 dataset.

Please note other layer has no such issue, the parameter count required by TensorRT for each layer except “fc6” is exactly the same as that parsed out from .caffemodel, don’t know why TensorRT calculated wrongly for “fc6” layer.

If I forcibly only copy 18432 parameters for “fc6” layer, then the network can be created successfully, the final log messages output is :

Detected 1 input and 1 output network tensors.

I also printed out the shape of the output tensor of network, it is (256,101,1,1) , it looks like the FullyConnected Layer does not merge the 256 channels’ data while doing inner product operation.

So, although the network could be created successfully by only copying 18432 kernel weight parameters for “fc6” layer, the precision of classification is very bad. I made the network inference with data, each time the classification probability data and the corresponding element index/class label index in the network’s output tensor are almost the same, although I tested with the network inference with different data belonging to different class, e.g, testing the network inference by filling input data with 16 walking images and 16 swimming images, the classification result the network output is always walking. I guess this low inference precision is related to the wrong number of the kernel-weights parameters of the “fc6” layer and TensorRT’s FullyConnected layer doesn’t merge all the 256 channels while doing inner product.

We do want to use TRT API to implement C3D network on Nano and make product deliveries for large-scale deployment, as very much memory is saved if a network is implemented by TRT API, this is a big benefit. But now we are blocked here.

Dear NVIDIA guys, could you please check why TensorRT’s FC layer cannot support 3D network correctly? thanks in advance.


We have a question about your environment setting.
Do you really use JetPack 4.3 with Deepstream 5.0?

We only support Deepstream 4.0 for JetPack4.3 environment.
Please help to double check it first.


Sorry, I really work on Nano, I mistake PC environment with Nano, on my Nano, L4T 32.4.3 and Deepstream5.0 are used, thanks.

I compile and ran my video-caffe network code implemented with TensorRT API on my Nano.


Thanks for the clarification.

This looks like a TensorRT issue rather than Deepstream problem.
Would you mind to share a simple reproducible source so we can check it deeper?


Thanks a lot, please tell me your email address, I’ll send you our source code and weights file.


Would you mind to share it via private message directly?

Hi, I have done more investigation and found the difference between my network implemented with TensorRT API and video-caffe is the shape of output tensor of the layers pool5 and fc6 (for network definition,please see https://github.com/chuckcho/video-caffe/blob/master/examples/c3d_ucf101/c3d_ucf101_deploy.prototxt):

  1. The shape of the output tensor of pool5 layer is not the same as that in video-caffe:
    The shape of the input tensor of pool5 layer is (256,2,7,7) , this is the same in both my network implemented with TensorRT API and video-caffe, but in my network implemented with TensorRT API, the shape of the output tensor of pool5 is (256,1,3,3) (this explains why TensorRT requires the parameter count for fc6 layer is 18432 = 2048 x 1 x 3 x 3), while that is (256,1,4,4) in video-caffe, I think this is because video-caffe pooling gives a 3D padding (0,1,1) by default when the shape of one of the 256 channels 3D data is (2,7,7), so I added such a workaround for pool5 in my network :

    if (strcmp(pooling_name,“pool5”)==0) {
    nvinfer1::Dims dims_padding{3,{0,1,1},{}};
    then the shape of the output tensor of pool5 layer also became (256,1,4,4), which is now the same as that in video-caffe . So, the left problem is:

  2. The shape of the output tensor of fc6 layer is not right:
    In my network implemented with TensorRT API the shape of the output tensor of fc6 layer is (256,2048), while that is (1,2048) in video-caffe, so, it looks like video-caffe merges the 256 channels’ data while doing inner product but TensorRT doesn’t, TensorRT does inner product for each channel parallely and finally outputs the probability/score data with shape (256,101), while video-caffe outputs the probability/score data with shape (1,101) as its fc6 layer has merged 256 channels’ data when doing inner product, this is root cause.

    So, now, after having applied the above workaround, TensorRT requires 2048x1x4x4 = 32768 kernel-weight parameters for fc6 layer in my network, but the count of the kernel-weight parameters parsed out from .modelcaffe , which is saved by video-caffe training process, is 2048x256x1x4x4 = 8388608, the error message now is:

     fc6: kernel weights has count 8388608 but 32768 was expected
    Could not compute dimensions for (Unnamed Layer* 15) [Fully Connected]_output, 
    because the network is invalid.
    Network validation failed.  

If I forcibly only copy 32768 parameters for “fc6” layer, then the network can be created successfully, but the inference result is still very bad.

For your reference, here I paste the shape data of output tensor of each layer in my latest network implemented with TensorRT API:

conv layer: conv1a outputDims n:4,shape:(64,16,112,112)
relu layer: relu1a outputDims n:4,shape:(64,16,112,112)
pooling layer: pool1 outputDims n:4,shape:(64,16,56,56)
conv layer: conv2a outputDims n:4,shape:(128,16,56,56)
relu layer: relu2a outputDims n:4,shape:(128,16,56,56)
pooling layer: pool2 outputDims n:4,shape:(128,8,28,28)
conv layer: conv3a outputDims n:4,shape:(256,8,28,28)
relu layer: relu3a outputDims n:4,shape:(256,8,28,28)
pooling layer: pool3 outputDims n:4,shape:(256,4,14,14)
conv layer: conv4a outputDims n:4,shape:(256,4,14,14)
relu layer: relu4a outputDims n:4,shape:(256,4,14,14)
pooling layer: pool4 outputDims n:4,shape:(256,2,7,7)
conv layer: conv5a outputDims n:4,shape:(256,2,7,7)
relu layer: relu5a outputDims n:4,shape:(256,2,7,7)
pooling layer: pool5 outputDims n:4,shape:(256,1,4,4)
Layer: fc6 outputDims n:4,shape:(256,2048,1,1)
Layer: relu6 outputDims n:4,shape:(256,2048,1,1)
Layer: fc7 outputDims n:4,shape:(256,2048,1,1)
Layer: relu7 outputDims n:4,shape:(256,2048,1,1)
Layer: fc8 outputDims n:4,shape:(256,101,1,1)
Layer: prob outputDims n:4,shape:(256,101,1,1)

and the corresponding shape data info in video-caffe is :

conv layer: conv1a, shape:(64,16,112,112)
relu layer: relu1a, shape:(64,16,112,112)
pooling layer: pool1, shape:(64,16,56,56)
conv layer: conv2a, shape:(128,16,56,56)
relu layer: relu2a, shape:(128,16,56,56)
pooling layer: pool2, shape:(128,8,28,28)
conv layer: conv3a, shape:(256,8,28,28)
relu layer: relu3a,shape:(256,8,28,28)
pooling layer: pool3,shape:(256,4,14,14)
conv layer: conv4a,shape:(256,4,14,14)
relu layer: relu4a,shape:(256,4,14,14)
pooling layer: pool4,shape:(256,2,7,7)
conv layer: conv5a,shape:(256,2,7,7)
relu layer: relu5a,shape:(256,2,7,7)
pooling layer: pool5,shape:(256,1,4,4)
Layer: fc6,shape:(1,2048)
Layer: relu6,shape:(1,2048)
Layer: fc7,shape:(1,2048)
Layer: relu7,shape:(1,2048)
Layer: fc8,shape:(1,101)
Layer: prob,shape:(1,101)

Please note I ignored all the dropout layers in video-caffe as dropout is not supported in TensorRT.

Here I try to paste out my source code briefly for your reference:

nvinfer1::ITensor* TRTC3D::createLayerBlock(SampleUniquePtrnvinfer1::INetworkDefinition& network,
nvinfer1::ITensor* input, const char* conv_name,const char* relu_name,const char* pooling_name,int32_t nbFilters,
nvinfer1::Dims dims_kernel,nvinfer1::Dims dims_sp,nvinfer1::Dims dims_pooling,
nvinfer1::Weights weight, nvinfer1::Weights bias) {
IConvolutionLayer* layer_conv = network->addConvolutionNd(input,nbFilters,dims_kernel, weight, bias);
layer_relu_conv = network->addActivation(layer_conv->getOutput(0),nvinfer1::ActivationType::kRELU);
layer_pool_conv = network->addPoolingNd(*layer_relu_conv->getOutput(0),nvinfer1::PoolingType::kMAX,dims_pooling);
//workaround: to simulate video-caffe’s max-pooling behaviour: (2,7,7)-> max pooling ->(1,4,4)
if (strcmp(pooling_name,“pool5”)==0) {
nvinfer1::Dims dims_padding{3,{0,1,1},{}};
return layer_pool_conv->getOutput(0);

bool TRTC3D::constructNetwork(
SampleUniquePtrnvcaffeparser1::ICaffeParser& parser, SampleUniquePtrnvinfer1::INetworkDefinition& network)
nvinfer1::ITensor* data = network->addInput(mParams.inputTensorNames[0].c_str(), nvinfer1::DataType::kFLOAT, nvinfer1::Dims{4,{3, 16, 112, 112},{}}); //data
nvinfer1::Dims dims_kernel{3,{3,3,3},{}};
nvinfer1::Dims dims_sp{3,{1,1,1},{}};
nvinfer1::Dims dims_pooling{3,{1,2,2},{}};
data = createLayerBlock(network,data,“conv1a”,“relu1a”,“pool1”,64,dims_kernel,dims_sp,dims_pooling,mWeightsMap[“conv1a”][0],mWeightsMap[“conv1a”][1]);
data = createLayerBlock(network,data,“conv2a”,“relu2a”,“pool2”,128,dims_kernel,dims_sp,dims_pooling,mWeightsMap[“conv2a”][0],mWeightsMap[“conv2a”][1]);
data = createLayerBlock(network,data,“conv3a”,“relu3a”,“pool3”,256,dims_kernel,dims_sp,dims_pooling,mWeightsMap[“conv3a”][0],mWeightsMap[“conv3a”][1]);
data = createLayerBlock(network,data,“conv4a”,“relu4a”,“pool4”,256,dims_kernel,dims_sp,dims_pooling,mWeightsMap[“conv4a”][0],mWeightsMap[“conv4a”][1]);
data = createLayerBlock(network,data,“conv5a”,“relu5a”,“pool5”,256,dims_kernel,dims_sp,dims_pooling,mWeightsMap[“conv5a”][0],mWeightsMap[“conv5a”][1]);
nvinfer1::IFullyConnectedLayer* fcLayer1 = network->addFullyConnected(data,2048,mWeightsMap[“fc6”][0],mWeightsMap[“fc6”][1]);
fc1_relu = network->addActivation(fcLayer1->getOutput(0),nvinfer1::ActivationType::kRELU);
fcLayer2 = network->addFullyConnected(fc1_relu->getOutput(0),2048,mWeightsMap[“fc7”][0],mWeightsMap[“fc7”][1]);
fc2_relu = network->addActivation(fcLayer2->getOutput(0),nvinfer1::ActivationType::kRELU);
fcLayer3 = network->addFullyConnected(fc2_relu->getOutput(0),101,mWeightsMap[“fc8”][0],mWeightsMap[“fc8”][1]);
softmax_layer = network->addSoftMax(*fcLayer3->getOutput(0));
data = softmax_layer->getOutput(0);


Sorry for keeping you waiting.

1. NdPooling with stride=[2x2x2], kernel=[2x2x2]

TensorRT: (256,2,7,7] -> [256x1x3x3]
Caffe: (256,2,7,7] -> [256x1x4x4]

Based on the source below, the output dimension difference comes from the rounding mode of pooling layer:

Caffe by default use ceil mode but TensorRT choose floor mode:
Add the padding parameter can approximate the ceil mode in TensorRT.

2. Fully connected layer

TensorRT: [256x1x4x4] -> [256,2048], expected weight number 32768
Caffe: [256x1x4x4] -> (1,2048), expected weight number 8388608

Based on your statement, it seems that some flatten layer is applied automatically before the fully connected layer.
So guess the workflow of your use case should look like this:

[256x1x4x4] -> flatten -> [1x4096] -> fully -> [1x2048]

So the expected weight value is 4096x2048=8388608.
Please add a flatten layer to see if help first.


Thanks a lot for your feedback.

WRT “add a flatten layer”, it looks like nvinfer1::INetworkDefinition Class has no such API, do I need to implement it with addPluginV2() by myself or just call addConvolutionNd() with kernelSize (1,1,1) ?


No. Flatten layer should be a special case of reshape(Shuffle) layer:

You can find an example in the onnx2trt source:


Thanks a lot ! as per your guide, after I added a shuffle layer and reshaped the shape of the output tensor of fc5 layer to (4096,1,1) by shuffleLayer’s setReshapeDimensions(), now network inference can work.

Question on two points:

  1. Against the original video-caffe network C++ code running on Nano, my this network implemented with TensorRT API has a little precision degradation (p.s., other onnx model has similar problem when it is parsed into TensorRT engine and doing inference with the engine), although my this network uses fp32, not fp16 or int8;
  2. There is also some performance degradation, it looks like TensorRT doesn’t use cudnn by default, so, the 3D convolution performance is not so good as that of the original video-caffe network, whose 3d convolution is implemented with cudnn API.

Do you have any suggestion about improving the inference precision and 3dconv performance of the network implemented with TensorRT API? thanks.


1. There are two possible reasons for this:

  • Different input image:
    A common issue is from the data preprocessing step.
    Please check if they are using the same data format, mean subtraction and normalization procedure.

  • Some difference between TensorRT API and Caffe.
    It’s recommended to check if the pooling approximation discussed above causes any precision issue or not.

2. TensorRT is implemented mainly with cuDNN.
Do you have any extra memcpy when deploying a layer with TensorRT API?
This may cause some performance issue compared to the cuDNN.


Thanks for your feedback, both video-caffe and my this network use the same input image data, as they use the same images and have the same processInput() (see below), my inference code for this network is as following :

vector<float> TRTC3D::infer(vector<Mat>* p_cvImgs, int nClasses)
    // Create RAII buffer manager object
    samplesCommon::BufferManager buffers(mEngine, mParams.batchSize);

    auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
    if (!context)
        return vector<float>();

    // Read the input data into the managed buffers
    // There should be just 1 input tensor
    assert(mParams.inputTensorNames.size() == 1);
    if (!processInput(buffers, p_cvImgs ))
        return vector<float>();
#if  1
    // Create CUDA stream for the execution of this inference.
    cudaStream_t stream;

    // Asynchronously copy data from host input buffers to device input buffers

    // Asynchronously enqueue the inference work
    if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
        return vector<float>();
    // Asynchronously copy data from device output buffers to host output buffers

    // Wait for the work in the stream to complete

    // Release stream

   bool status = context->execute(mParams.batchSize,buffers.getDeviceBindings().data());
   if (!status) {
     return vector<float>();

   vector<float> result = processOutput(buffers,nClasses);
   return result;

to use cuDNN, do I need to replace the cudaxxx() API called above with cudnnxxx() API ? if yes, is there a good/complete sample source code ? I find some code in /usr/src/tensorrt/samples/samlePlugin/fcPlugin.h, but it looks like the code is incomplete.


To figure out the root cause of the difference will need to investigate the output layer by layer.
A common issue is that some parameter is missing or not be set correctly.
Ex. coefficient in the eltwise layer.

Not sure which API do you imply.
For inference API, TensorRT API is built on the cuDNN, and it should be nvinfer1::....

Some basic API, like cudaStreamCreate, is used for controlling the workflow.
You don’t need to convert it into cudnn_


My network, which I pasted out in 23/Sep, is totally constructed by nvinfer:: API, it is really in such a function:

bool TRTC3D::constructNetwork(
    SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, SampleUniquePtr<nvinfer1::INetworkDefinition>& network)

and the code which call this function to build the network is as following:

    auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
    if (!builder)
        return false;

    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetwork());
    if (!network)
        return false;

    auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
    if (!config)
        return false;

    auto parser = SampleUniquePtr<nvcaffeparser1::ICaffeParser>(nullptr);   

    if (!constructNetwork(parser, network))
        return false;

for the inference code, as the above code I pasted out in shows 15/Oct, which are also almost copied from your example code, while doing inference, first process input data then enqueue data into GPU deivce’s buffer and then execute and process the output data.

By observing the memory, I can see, only when cudnxxx() API is forcibly called in my TRTC3D::infer(), such as “cudnnCreate(&cudnn);”, the memory occupation has 700+ M more than that when “cudnnCreate(&cudnn);” is commented out. So, looking from memory occupation, I guess cuDNN library is not used by default, unless cudnnxxx() API is called.

Do you mean, when “nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger())” is called, cuDNN library is loaded automatically ?


Most of our library is configured as on-demand.
So TensorRT will load the cuDNN library until inferencing rather than building time.


But I never saw memory occupation increased very much when I did inference with my this network implemented with TensorRT API. I saw an about 700M more memory was occupied if I forcibly added a call on cudnnxxx() in the inference code of my this network, just like video-caffe does as it calls cudnnxxx() in its convolution layers.
I observed the memory increasement by jtop or by adding code suggested here :