cudnn6 slow and problematic on TX2, JetPack 3.1

AndreiStoian · September 5, 2017, 7:53am

Hi,

I’ve recently switched to JetPack 3.1/tensorrt 2.1,cudnn6 from the previous version which had tensorrt 2.0 and cudnn5. I’m experiencing two problems:

the CNN that I’m running is slower than before (about 15%) - I wasn’t expecting a full 2x speedup as advertised but at least some improvement.

The problem seems to happen when running both through Caffe (as someone else posted here: cuDNN 6 slower than 5.1 for Caffe CNN net · Issue #5490 · BVLC/caffe · GitHub) and in my experiments with TensorRT. I’m using a slightly modified TensorRT / MNIST sample to run the conversion. I can show you the code if it helps but it’s the most basic conversion code possible.

The network is based on VGG with heavy convolutions after the initial feature extraction. These convolutions have filters of size 7x7 (pad 3) and 1x1 .

TensorRT can not be used at the same time as a cudnn enabled Caffe instance. I’m getting an error in initializeCommonContext in tensorrt when calling serializeCudaEngine. Disabling cudnn in Caffe fixes the problem.

Are there any workarounds ? I’d like to stick to TensorRT 2.1 (it supports more layers) but use cudnn5 - is this possible ? Should I wait for cudnn7 ?

Andrei STOIAN
R&D Eng. at Thales

AastaLLL · September 5, 2017, 9:51am

Hi,

We don’t have TensorRT2.0 for Tegra before.
Guess that you used TensorRT1.0+cuDNNv5 on JetPack3.0.

1.
Here is the benchmark score for JetPack3.0 v.s. JetPack3.1:
https://devblogs.nvidia.com/parallelforall/jetpack-doubles-jetson-inference-perf/
This results tested the public GoogleNet inference and got twice speed-up with JetPack3.1.
We will recheck the performance and, if possible, it will be hugely helpful to have your model.

2.
Ideally, Caffe and TensorRT can run at the same time after TensorRT v2.1
Do you meet the error when using TensorRT2.1?

3.
TensorRT2.1 needs cuDNNv6.

By the way, please remember to maximize the TX2 performance via the following command:

sudo ./jetson_clock.sh
sudo nvpmodel -m 0

Thanks.

AndreiStoian · September 5, 2017, 3:53pm

Yes, sorry, I meant TensorRT1+cuDNN v5 on Jetpack 3.0.

The network I’m benchmarking uses a VGG feature extractor and then refinement blocks from the OpenPose architecture: openpose/pose_deploy_linevec.prototxt at master · CMU-Perceptual-Computing-Lab/openpose · GitHub . Note the Concatenate layers which are not supported by TensorRT 1. I split the network on these layers and I’m benchmarking the refinement blocks inbetween : 5 blocks of 2 parallel chains of (5 times 7x7 conv followed by 2 times 1x1 conv) individually, then summing up the inference times for each block.

I ran the commands you gave to get maximum performance. Both the TensorRT1+cudnn5 and TensorRT2.1+cudnn6 get performance boosts but the slow down still exists, I’m getting a 6% slowdown passing from the first to the latter. Note that this is in fp16 mode with Batch Size 2. In fp32 mode, batch size=1 I’m getting a minor, 1% speedup when passing from TensorRT1+cudnn5 to TensorRT2.1+cudnn6. I’m working on testing on Caffe/cudnn5 vs 6 without TensorRT.

Note: I’m also setting setMinFindIterations and setAverageFindIterations to 5.
With respect to TensorRT and Caffe+cudnn running at the same time the problem arises with TensorRT2.1/cudnn6 and not with TensorRT1/cudnn5. It seems to me that it's a problem with cudnn initialization, TensorRT initializes cudnn after Caffe initializes it's cudnn context.

AastaLLL · September 6, 2017, 1:59am

Hi,

Thanks for your feedback.

Both issues are important to us.
We are checking this issue internally, will share information with you later.

AastaLLL · September 6, 2017, 2:44am

Hi,

We tried to reproduce the Caffe+TensorRT fail in initializeCommonContext issue, but both work correctly in our environment.
We launch Caffe(BVLC branch) for MNist training and run TensorRT for GoogleNet inference. No error occurs.

Could you share more details on how to reproduce this error? Do you launch Caffe and TensorRT in the same application?

AndreiStoian · September 6, 2017, 10:48am

Thanks for looking into this.

With respect to Caffe and TensorRT incompatibility, yes I’m running both in the same application, sorry for not being explicit in my initial post.

I’m basically linking libcaffe.so (that was built with cudnn support), creating a caffe::Net object, loading weights and running inference (net->forward()) and then I deserialize a TensorRT engine from a file - that’s when I get the error. The error in TensorRT is in a file name something like cudnnEngine.cpp so i’m suspecting a problem at cudnn initialization in TensorRT.

AastaLLL · September 7, 2017, 5:31am

Hi,

Thanks for your feedback.
Could you share the source code for us reproducing this error?

Thanks.

AndreiStoian · September 7, 2017, 7:49pm

Here’s the code I’m using to build the TensorRT model, it crashes on the last line. This happens when I create a caffe::Net object from an existing prototxt/caffemodel and run forward on it before running this TensorRT code. It only happens when Caffe is built with cudnn support.

IBuilder* builder = createInferBuilder(gLogger);
const char* prototxt=modelProtoFile;
const char* caffemodel=caffeModelFile;

//
nvinfer1::DataType modelDataType = mEnableFP16 ? nvinfer1::DataType::kHALF : nvinfer1::DataType::kFLOAT; // create a 16-bit model if it's natively supported

// parse the caffe model to populate the network, then set the outputs and create an engine
INetworkDefinition* network = builder->createNetwork();
ICaffeParser *parser = createCaffeParser();
const IBlobNameToTensor *blobNameToTensor =
      parser->parse(prototxt,		// caffe deploy file
      caffemodel,		// caffe model file
      *network,		// network definition that the parser will populate
      modelDataType);


assert(blobNameToTensor != nullptr);
// the caffe file has no notion of outputs
// so we need to manually say which tensors the engine should generate

for (int i = 0; i < outputs.size(); ++i)
  network->markOutput(*blobNameToTensor->find(outputs[i].c_str()));

// Build the engine
//the maximum batch size which can be used at execution time, and also the batch size for which the engine will be optimized
builder->setMaxBatchSize(max_batch_size);

//	the maximum GPU temporary memory which the engine can use at execution time
builder->setMaxWorkspaceSize(16 << 20);//WORKSPACE_SIZE);

// set up the network for paired-fp16 format
if(mEnableFP16)
  builder->setHalf2Mode(true);

// Eliminate the side-effect from the delay of GPU frequency boost
builder->setMinFindIterations(5);
builder->setAverageFindIterations(5);

// Build the engine
builder->setDebugSync(true);

//build
ICudaEngine *engine = builder->buildCudaEngine(*network);

AastaLLL · September 8, 2017, 9:20am

Hi,

Some update for performance issue:

We found that nvpmodel will reset CPU/GPU back to default.
This setting will cause poor performance. Root cause we are still clarifying.

Current, WAR is to run the following commands in sequence.

sudo nvpmodel -m 0       #This will enable two Denver CPU
sudo ./jetson_clock.sh   #This will maximize CPU/GPU performance

We can get twice acceleration with standard GoogleNet now.
Please help us checking your use-case.

Thanks.

AastaLLL · September 11, 2017, 9:01am

Hi,

For nvpmodel details, please check this command:
[url]https://devtalk.nvidia.com/default/topic/1023671/jetson-tx2/low-frame-rate-with-flir-camera-on-tx2-when-using-cudafilters-library-from-opencv/post/5208659/#5208659[/url]

andrei.stoian · October 18, 2017, 9:47pm

Hi,

I think the TensorRT/Caffe running in the same executable incompatibility problem is fixed, I reflashed the card and it’s working ok.

I’m still having problems with running in half2 fp16 mode: the network is openpose/pose_deploy_linevec.prototxt at master · CMU-Perceptual-Computing-Lab/openpose · GitHub (get the caffemodel with openpose/getModels.sh at master · CMU-Perceptual-Computing-Lab/openpose · GitHub).

It seems some convolutional layers, even though they compile, give invalid results (all 0s) in fp16 mode even though they work well in fp32 mode. Could you list the types of convolution that are available in fp16 mode or try to run the network I linked in fp16?

AastaLLL · October 19, 2017, 6:24am

Hi,

Please check here for the supported layers:
[url]NVIDIA Documentation Center | NVIDIA Developer

Thanks.

andrei.stoian · October 19, 2017, 7:16am

Yes, I’m aware of that list but it’s not very precise: are all convolution filter sizes supported in half2 fp16? 7x7, 1x1? (I’m working on the TX2 with tensorrt 2.1/cudnn6/jetpack 3.1) Could you take a look at the prototext I linked and check if there is any possible incompatibility in fp16?

So:

either I'm doing something wrong in the conversion process to fp16 - unlikely the exact same code works when I don't set enableHalf2Mode(true) and in fp32 mode.
either some of the convolutions are not supported by TensorRT
either there is a bug in TensorRT for some of these convolution filter sizes

I could also be wrong on all counts but I’d like to eliminate some possible causes first.

eric.creusen · October 19, 2017, 1:52pm

Just chiming in that I’ve also seen a 10-20% percent performance drop by upgrading to cudnn6 in my caffe network (I’m also running a model based on VGG).

Did you fix the performance problems yet, or are you still working on resolving your TensorRT problems?

andrei.stoian · October 19, 2017, 10:10pm

Hi and thanks for the information. I re-flashed my TX2 with JetPack 3.1 so I don’t have cudnn5 anymore thus I’m afraid I can’t compare it to cudnn6 anymore. But I do think the slowdown still exists, maybe it was fixed in TensorRT 3 but I haven’t tried it out yet.

AastaLLL · October 20, 2017, 2:11am

Hi,

Sorry. The document shared in comment #12 is not the latest.
Please check the document here:
https://developer.nvidia.com/compute/machine-learning/tensorrt/secure/3.0/rc1TensorRT3-Release-Notes-RC-pdf

When using reduced precision, either INT8 or FP16, on platforms with hardware
support for those types, pooling with window sizes other than 1,2,3,5 or 7 will fail.

andrei.stoian · October 23, 2017, 2:43pm

Thanks for this document. However, I’m still having problems and I have narrowed it down to the Concatenate layer. Note that this is in fp16 half2 mode.

If I let TensorRT optimize the whole network, containing several concatenate layers, each having multiple input blobs the output of the network is wrong.

However, if I to the concatenation myself in custom layers, everything works fine. I don’t see a big difference performance wise (about 0.5%) so it’s not a major problem.

However, this was a very difficult problem to find - could you check if it is indeed a problem in TensorRT (2.1 / fp16 mode). Maybe it only shows for some Concatenate layer configurations (more than two input blobs) ? In my experiments it doesn’t occur in fp32 mode.

Were there any changes with respect to the Concatenate layer going to TensorRT 3?

AastaLLL · October 24, 2017, 7:34am

Hi,

Could you share your Concatenate layer definition for us checking?
Thanks.

andrei.stoian · October 24, 2017, 7:37am

The network definition is this one:

[url]https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/models/pose/coco/pose_deploy_linevec.prototxt[/url]

AastaLLL · October 25, 2017, 6:06am

Hi,

From document:

Concatenation
The concatenation layer links together multiple tensors of the same height and width across the channel dimension.

Axis parameter is not functional. All the blobs are concatenated along the channel dimension.
Does your use-case also want to concatenate all blobs along channel dimension?

Thanks.

Topic		Replies	Views
JetPack 3.1 post installation on TX2 issue Jetson TX2	10	1050	October 18, 2021
Caffe model with Concatenation layer gives wrong results using TensorRT 3/4 on TX2 Jetson TX2	12	1528	October 18, 2021
Inference Time is not stable TensorRT	10	1838	January 3, 2019
Error with Concatenate Layer in TensorRT2 Jetson TX1	16	3763	October 18, 2021
can the nvidia TensorRT accelerate SSD(single shot detector)? Jetson TX2	22	9057	October 18, 2021
I don't get similar results with TensorRT and the trained tensorflow model! Jetson TX2	20	4599	October 18, 2021
Tensor RT supports caffe model layers Jetson TX1	28	10670	October 18, 2021
Issues with TensorRT on Drive PX2 Jetson TX2	2	741	October 18, 2021
Performance difference of tensorRT versus nvcaffe+cuDNN GPU-Accelerated Libraries	2	2651	February 1, 2018
TensorRT YOLO inference error Jetson TX1	21	12590	October 18, 2021

cudnn6 slow and problematic on TX2, JetPack 3.1

Related topics