cudnn6 slow and problematic on TX2, JetPack 3.1

Hi,

I’ve recently switched to JetPack 3.1/tensorrt 2.1,cudnn6 from the previous version which had tensorrt 2.0 and cudnn5. I’m experiencing two problems:

  • the CNN that I’m running is slower than before (about 15%) - I wasn’t expecting a full 2x speedup as advertised but at least some improvement.

The problem seems to happen when running both through Caffe (as someone else posted here: cuDNN 6 slower than 5.1 for Caffe CNN net · Issue #5490 · BVLC/caffe · GitHub) and in my experiments with TensorRT. I’m using a slightly modified TensorRT / MNIST sample to run the conversion. I can show you the code if it helps but it’s the most basic conversion code possible.

The network is based on VGG with heavy convolutions after the initial feature extraction. These convolutions have filters of size 7x7 (pad 3) and 1x1 .

  • TensorRT can not be used at the same time as a cudnn enabled Caffe instance. I’m getting an error in initializeCommonContext in tensorrt when calling serializeCudaEngine. Disabling cudnn in Caffe fixes the problem.

Are there any workarounds ? I’d like to stick to TensorRT 2.1 (it supports more layers) but use cudnn5 - is this possible ? Should I wait for cudnn7 ?

Andrei STOIAN
R&D Eng. at Thales

Hi,

We don’t have TensorRT2.0 for Tegra before.
Guess that you used TensorRT1.0+cuDNNv5 on JetPack3.0.

1.
Here is the benchmark score for JetPack3.0 v.s. JetPack3.1:
https://devblogs.nvidia.com/parallelforall/jetpack-doubles-jetson-inference-perf/
This results tested the public GoogleNet inference and got twice speed-up with JetPack3.1.
We will recheck the performance and, if possible, it will be hugely helpful to have your model.

2.
Ideally, Caffe and TensorRT can run at the same time after TensorRT v2.1
Do you meet the error when using TensorRT2.1?

3.
TensorRT2.1 needs cuDNNv6.

By the way, please remember to maximize the TX2 performance via the following command:

sudo ./jetson_clock.sh
sudo nvpmodel -m 0

Thanks.

Yes, sorry, I meant TensorRT1+cuDNN v5 on Jetpack 3.0.

  1. The network I’m benchmarking uses a VGG feature extractor and then refinement blocks from the OpenPose architecture: openpose/pose_deploy_linevec.prototxt at master · CMU-Perceptual-Computing-Lab/openpose · GitHub . Note the Concatenate layers which are not supported by TensorRT 1. I split the network on these layers and I’m benchmarking the refinement blocks inbetween : 5 blocks of 2 parallel chains of (5 times 7x7 conv followed by 2 times 1x1 conv) individually, then summing up the inference times for each block.

    I ran the commands you gave to get maximum performance. Both the TensorRT1+cudnn5 and TensorRT2.1+cudnn6 get performance boosts but the slow down still exists, I’m getting a 6% slowdown passing from the first to the latter. Note that this is in fp16 mode with Batch Size 2. In fp32 mode, batch size=1 I’m getting a minor, 1% speedup when passing from TensorRT1+cudnn5 to TensorRT2.1+cudnn6. I’m working on testing on Caffe/cudnn5 vs 6 without TensorRT.

    Note: I’m also setting setMinFindIterations and setAverageFindIterations to 5.

  2. With respect to TensorRT and Caffe+cudnn running at the same time the problem arises with TensorRT2.1/cudnn6 and not with TensorRT1/cudnn5. It seems to me that it's a problem with cudnn initialization, TensorRT initializes cudnn after Caffe initializes it's cudnn context.

Hi,

Thanks for your feedback.

Both issues are important to us.
We are checking this issue internally, will share information with you later.

Hi,

We tried to reproduce the Caffe+TensorRT fail in initializeCommonContext issue, but both work correctly in our environment.
We launch Caffe(BVLC branch) for MNist training and run TensorRT for GoogleNet inference. No error occurs.

Could you share more details on how to reproduce this error? Do you launch Caffe and TensorRT in the same application?

Thanks for looking into this.

With respect to Caffe and TensorRT incompatibility, yes I’m running both in the same application, sorry for not being explicit in my initial post.

I’m basically linking libcaffe.so (that was built with cudnn support), creating a caffe::Net object, loading weights and running inference (net->forward()) and then I deserialize a TensorRT engine from a file - that’s when I get the error. The error in TensorRT is in a file name something like cudnnEngine.cpp so i’m suspecting a problem at cudnn initialization in TensorRT.

Hi,

Thanks for your feedback.
Could you share the source code for us reproducing this error?

Thanks.

Here’s the code I’m using to build the TensorRT model, it crashes on the last line. This happens when I create a caffe::Net object from an existing prototxt/caffemodel and run forward on it before running this TensorRT code. It only happens when Caffe is built with cudnn support.

IBuilder* builder = createInferBuilder(gLogger);
const char* prototxt=modelProtoFile;
const char* caffemodel=caffeModelFile;

//
nvinfer1::DataType modelDataType = mEnableFP16 ? nvinfer1::DataType::kHALF : nvinfer1::DataType::kFLOAT; // create a 16-bit model if it's natively supported

// parse the caffe model to populate the network, then set the outputs and create an engine
INetworkDefinition* network = builder->createNetwork();
ICaffeParser *parser = createCaffeParser();
const IBlobNameToTensor *blobNameToTensor =
      parser->parse(prototxt,		// caffe deploy file
      caffemodel,		// caffe model file
      *network,		// network definition that the parser will populate
      modelDataType);


assert(blobNameToTensor != nullptr);
// the caffe file has no notion of outputs
// so we need to manually say which tensors the engine should generate

for (int i = 0; i < outputs.size(); ++i)
  network->markOutput(*blobNameToTensor->find(outputs[i].c_str()));

// Build the engine
//the maximum batch size which can be used at execution time, and also the batch size for which the engine will be optimized
builder->setMaxBatchSize(max_batch_size);

//	the maximum GPU temporary memory which the engine can use at execution time
builder->setMaxWorkspaceSize(16 << 20);//WORKSPACE_SIZE);

// set up the network for paired-fp16 format
if(mEnableFP16)
  builder->setHalf2Mode(true);

// Eliminate the side-effect from the delay of GPU frequency boost
builder->setMinFindIterations(5);
builder->setAverageFindIterations(5);

// Build the engine
builder->setDebugSync(true);

//build
ICudaEngine *engine = builder->buildCudaEngine(*network);

Hi,

Some update for performance issue:

We found that nvpmodel will reset CPU/GPU back to default.
This setting will cause poor performance. Root cause we are still clarifying.

Current, WAR is to run the following commands in sequence.

sudo nvpmodel -m 0       #This will enable two Denver CPU
sudo ./jetson_clock.sh   #This will maximize CPU/GPU performance

We can get twice acceleration with standard GoogleNet now.
Please help us checking your use-case.

Thanks.

Hi,

For nvpmodel details, please check this command:
[url]https://devtalk.nvidia.com/default/topic/1023671/jetson-tx2/low-frame-rate-with-flir-camera-on-tx2-when-using-cudafilters-library-from-opencv/post/5208659/#5208659[/url]

Hi,

I think the TensorRT/Caffe running in the same executable incompatibility problem is fixed, I reflashed the card and it’s working ok.

I’m still having problems with running in half2 fp16 mode: the network is openpose/pose_deploy_linevec.prototxt at master · CMU-Perceptual-Computing-Lab/openpose · GitHub (get the caffemodel with openpose/getModels.sh at master · CMU-Perceptual-Computing-Lab/openpose · GitHub).

It seems some convolutional layers, even though they compile, give invalid results (all 0s) in fp16 mode even though they work well in fp32 mode. Could you list the types of convolution that are available in fp16 mode or try to run the network I linked in fp16?

Hi,

Please check here for the supported layers:
[url]NVIDIA Documentation Center | NVIDIA Developer

Thanks.

Yes, I’m aware of that list but it’s not very precise: are all convolution filter sizes supported in half2 fp16? 7x7, 1x1? (I’m working on the TX2 with tensorrt 2.1/cudnn6/jetpack 3.1) Could you take a look at the prototext I linked and check if there is any possible incompatibility in fp16?

So:

  • either I'm doing something wrong in the conversion process to fp16 - unlikely the exact same code works when I don't set enableHalf2Mode(true) and in fp32 mode.
  • either some of the convolutions are not supported by TensorRT
  • either there is a bug in TensorRT for some of these convolution filter sizes

I could also be wrong on all counts but I’d like to eliminate some possible causes first.

Just chiming in that I’ve also seen a 10-20% percent performance drop by upgrading to cudnn6 in my caffe network (I’m also running a model based on VGG).

Did you fix the performance problems yet, or are you still working on resolving your TensorRT problems?

Hi and thanks for the information. I re-flashed my TX2 with JetPack 3.1 so I don’t have cudnn5 anymore thus I’m afraid I can’t compare it to cudnn6 anymore. But I do think the slowdown still exists, maybe it was fixed in TensorRT 3 but I haven’t tried it out yet.

Hi,

Sorry. The document shared in comment #12 is not the latest.
Please check the document here:
https://developer.nvidia.com/compute/machine-learning/tensorrt/secure/3.0/rc1TensorRT3-Release-Notes-RC-pdf

When using reduced precision, either INT8 or FP16, on platforms with hardware
support for those types, pooling with window sizes other than 1,2,3,5 or 7 will fail.

Thanks for this document. However, I’m still having problems and I have narrowed it down to the Concatenate layer. Note that this is in fp16 half2 mode.

If I let TensorRT optimize the whole network, containing several concatenate layers, each having multiple input blobs the output of the network is wrong.

However, if I to the concatenation myself in custom layers, everything works fine. I don’t see a big difference performance wise (about 0.5%) so it’s not a major problem.

However, this was a very difficult problem to find - could you check if it is indeed a problem in TensorRT (2.1 / fp16 mode). Maybe it only shows for some Concatenate layer configurations (more than two input blobs) ? In my experiments it doesn’t occur in fp32 mode.

Were there any changes with respect to the Concatenate layer going to TensorRT 3?

Hi,

Could you share your Concatenate layer definition for us checking?
Thanks.

The network definition is this one:

[url]https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/models/pose/coco/pose_deploy_linevec.prototxt[/url]

Hi,

From document:

Concatenation
The concatenation layer links together multiple tensors of the same height and width across the channel dimension.

Axis parameter is not functional. All the blobs are concatenated along the channel dimension.
Does your use-case also want to concatenate all blobs along channel dimension?

Thanks.