Caffe model with Concatenation layer gives wrong results using TensorRT 3/4 on TX2

At first, my Caffe model defined by “Input”, “Convolution”, “BatchNorm”, “Scale”, “ReLU” and “Pooling” layers works fine on my TX2 with Jetpack 3.2.1 and TensorRT 3.0.4-1.
Then I modified the model, which contain additional “Concatenation” layers. The model works fine on my host PC while tested with pyCaffe. Then I deploy the model to TX2, but it gives wrong results with all the output be “0.0”.
I think it might be caused by the newly added layer of type “Concatenation”. I found this topic, which informs that the problem was reported and should be solved in TensorRT 3.0 GA. Then I found out “Concatenation” is not listed in the layer types supported by NvCaffeParser in Developer-Guide( NvCaffeParser) Instead, I found it in the guide for TensorRT 4 (3.1. Supported Operations)
So I install Jetpack 3.3, which contains TensorRT 4.0.2-1, but the wrong results still arise.
Does anyone test TensorRT on TX2 with a Caffe model contains “Concatenation” layer?


Could you share your caffemodel for us checking?

@AstaLLL, sorry for the latency. I have sent them to you via private message.


Thanks for your update.

We can launch your model with TensorRT.
Will check this in detail and update information with you.


Any progress?
Did you reproduce the wrong results of zeros using the model with “Concatenation” layers?


Sorry for the late.

We also get zero output of both models.
What is your input data range? [0, 255] or [0,1]?

Could you share a simple sample to run for your model?


Sorry for the late, too.

The input data is normalized by BGR/255 to [0, 1].

Sorry, I can’t share a sample.

The result given by Caffe and TensorRT shouldn’t have obvious difference. I’ll try to build the model with TensorRT API latter, if it works, then it’s the nvCaffeParser that cause the problem.

Did you mean both the models with and without “Concatenation” layers give zero outputs?
The model without “Concatenation” layers works fine for me. It’s the model with “Concatenation” layers gives zero outputs only.


We tested your model with an [0,255] input image.
Will try if we can reproduce this with a normalized [0,1] image.

It will also help if you can extract a simple source for us to reproduce.

Thanks for your effort.
Here is a part of my code which do the inference (similar to sampleMNIST in TensorRT release).

// runs the TensorRT inference engine
// allocates the buffer, sets inputs, executes the engine, and decode the output
std::vector<Detection> Detector::detect(const cv::Mat &image)
	// create buffer manager object
	mBuffers = std::make_unique<tensorrt::BufferManager>(mEngine, mParams.batchSize);
	if (!mBuffers) {
		std::cout << "Error creating buffer manager.\n";

	// create execution context
	mContext = std::move(UniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext()));
	if (!mContext) {
		std::cout << "Error creating execution context.\n";

	// Create CUDA stream for the execution of this inference.
	mStream = std::make_unique<cudaStream_t>();

	//std::cout << "Start the detection process.\n";
	const int inputC = mInputDims.d[0];
	const int inputH = mInputDims.d[1];
	const int inputW = mInputDims.d[2];

	cv::Mat resized;
	cv::resize(image, resized, cv::Size(inputW, inputH));

	// sets input
	//std::cout << "Setting input buffer.\n";
	float* hostInputBuffer = static_cast<float*>(mBuffers->getHostBuffer(mParams.inputTensorName));
	for (int indxC = 0; indxC < inputC; indxC++) {
		for (int indxH = 0; indxH < inputH; indxH++) {
			for (int indxW = 0; indxW < inputW; indxW++) {
				long long indxImg = indxH*inputW*inputC + indxW*inputC + indxC;
				long long indxInput = indxC*inputH*inputW + indxH*inputW + indxW;
				hostInputBuffer[indxInput] = float([indxImg]) / 255.0f;

	// Asynchronously copy data from host input buffers to device input buffers
	//std::cout << "Copying input buffer.\n";

	// Asynchronously enqueue the inference work
	//std::cout << "Inferencing.\n";
	if (!mContext->enqueue(mParams.batchSize, mBuffers->getDeviceBindings().data(), *mStream, nullptr)) exit(-1);

	// Asynchronously copy data from device output buffers to host output buffers
	//std::cout << "Copying output buffer.\n";

	// Wait for the work in the stream to complete

	// Get the output of the inference
	const float* netout = static_cast<const float*>(mBuffers->getHostBuffer(mParams.outputTensorName));

	return decode_netout(netout);

BTW, I had try the model with “Concatenation” layers using new TensorRT 5 on Windows 10 x64 with Visual Studio 2015, it works fine! I’ll try TensorRT 5 on TX2 later. A lot of work need to do if we move our project to JetPack 4 with TensorRT 5, so it’s a big help if you can find a solution with TensorRT 3, thanks a lot.

I tried TensorRT 3/4/5 on host PC with ubuntu 16.04 using the same Caffe model with “Concatenation” layers, only TensorRT 5 gives correct results.


There are several fixes in TensorRT 5.0 GA but this package is only available for the desktop user currently.
Please wait for our announcement for next JetPack release.