getDimensions() and getBindingDimensions() different in host and in Jetson AGX Xavier

Hi,
I’m having some interesting issue. I have an onnx network that I want to load and parse in tensorrt. Before I used a onnx2trt utility but now I directly parse it.
Some of the code I can’t put here since it’s from work, I tried to modify it to hopefully reflect my issue but if you need more info please ask:
bool profileOnnx(const std::string& network_path, std::ostream& gie_model_stream) {

  auto builder = tensorUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(gLogger));
  if (!builder) {
return false;
  }

  auto network = tensorUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0U));
  if (!network) {
return false;
  }
  auto config = tensorUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
  if (!config) {
return false;
  }


  config->setMinTimingIterations(3);
  config->setMaxWorkspaceSize(16 << 20);
  config->setAvgTimingIterations(2);

  auto parser =
  tensorUniquePtr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, gLogger));
  if (!parser) {
return false;
  }

  std::stringstream concatenated_path_stream;
  concatenated_path_stream << INSTALL_PREFIX << network_path;
  std::string onnx_path = concatenated_path_stream.str();

  int verbosity = (int)nvinfer1::ILogger::Severity::kERROR;
  if (!parser->parseFromFile(onnx_path.c_str(), verbosity)) {
return false;
  }

  builder->setMaxBatchSize(params.batch_size);
  if (strToDeviceType(params.inference_device) == deviceType::DEVICE_DLA) {
// Enabling DLA
config->setDefaultDeviceType(nvinfer1::DeviceType::kDLA);
// config->setDLACore(sys.DLACore);
// Allow GPU fallback
config->setFlag(nvinfer1::BuilderFlag::kGPU_FALLBACK);
  } else {
config->setDefaultDeviceType(nvinfer1::DeviceType::kGPU);
  }

  auto engine = std::shared_ptr<nvinfer1::ICudaEngine>(
  builder->buildEngineWithConfig(*network, *config), tensorDeleter());
  if (!engine) {
return false;
  }
  nvinfer1::Dims dimerinos = network->getOutput(0)->getDimensions();
  printf("DIMERINOS %d %d %d %d",dimerinos.nbDims,dimerinos.d[0],dimerinos.d[1],dimerinos.d[2]);
  nvinfer1::IHostMemory* net_mem = engine->serialize();
  gie_model_stream.write((const char*)net_mem->data(), net_mem->size());
  net_mem->destroy();
  return true;
}

The issue is that getDimensions() and getBindingDimensions() are having different results on both host and device. I know my network’s input and output size, in host (meaning my laptop) everything works ok and the dimentions are not wrong, but in the Xavier (I cross compile for Xavier, but all versions of installed packages like tensorrt, cuda, cudnn are the same) the output dimensions of the network are wrong, but the input dimensions are ok.

So for example on host my input dimensions are 3x480x720 and my output dimensions are 2x480x720 (batchsize of 1).
On Xavier I get input dimensions as 3x480x720 (ok) and output dimensions are 2x1x1 (which is wrong, but the C channel seems to be ok). This puzzles me a lot, and of course raised an error in memory allocation and accessing, since so little memory was allocated on device.

I’ll write a short reproducible example with a random network and try to update this, but wanted to see if you have any insight. I tried multiple things, and in other versions (TensorRT5) this issue didn’t appear.

Environment

TensorRT Version : 6.0.1-1+cuda10.0
GPU Type : GEForce GTX 1060 AND Nvidia Jetson AGX Xavier
Nvidia Driver Version : 440.64
CUDA Version : 10.0
CUDNN Version : 7.6.5.32-1+cuda10.0
Operating System + Version : Ubuntu 18.04
Baremetal or Container (if container which image + tag) : Baremetal

Hi,

Is this the desktop environment info?
Could you also tell us which JetPack version do you use for Xavier?

More, would you mind to share a simple reproducible source with onnx model so we can reproduce this in our environment?

Thanks.

1 Like

Hi @AastaLLL , thanks for reaching out. I’m appending a reproducible example based on the trt samples. I tested this on a Xavier on jetpack 4.3, so TensorRT 6.
The sample is this one: https://drive.google.com/open?id=1rW2vz8TtKLUekwGuMxZpaMR6Q_pukk83

You’ll see that the network outputs of the onnx file analyed by Netron is the following:
output

In the sample I tried to just parse the network and output the sizes:

  assert(network->getNbInputs() == 1);
  mInputDims = network->getInput(0)->getDimensions();
  assert(mInputDims.nbDims == 3);

  assert(network->getNbOutputs() == 1);
  mOutputDims = network->getOutput(0)->getDimensions();
  printf("INPUT DIMS: %d %d %d\n", mInputDims.d[0], mInputDims.d[1],
         mInputDims.d[2]);
  printf("OUTPUT DIMS: %d %d %d\n", mOutputDims.d[0], mOutputDims.d[1],
         mOutputDims.d[2]);

And the output is the following:
out2
As you can see, the input dimension is ok but the output dimension is wrong, but the channel ok. This of course doesn’t throw any error while parsing or generating the engine. I just get the warning of

WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).

But that I always used to get, even on TensorRT5
Any insight? This issue is really weird and has been tormenting me for a bit.

Hi @AastaLLL,

I’d like to confirm that running the sampleOnnxMNIST sample on xavier Jetpack4.3 fails.

Best regards,
Nico

Turns out I downloaded the wrong MNIST dataset from somewhere else because the download script is broken. This issue is linked to : Slice layer broken on TensorRT 6.0.1-1

Hi,

Just want to confirm that the sampleOnnxMNIST issue is independent to the original dimension issue, is this correct?

Thanks.

Hi @AastaLLL,

That is correct. After installing the extra python packages required to make the download script work and retrieving the correct dataset, the sample runs fine. The dimension issue remains unsolved.

Hi,

Sorry for keeping you waiting.

We try to run make on your sample on Xavier platform but found that nothing is compiled.

nvidia@nvidia-desktop:~/topic_118058/smpOnnx/samples$ make
make[1]: Entering directory '/home/nvidia/topic_118058/smpOnnx/samples/sampleOnnx'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/nvidia/topic_118058/smpOnnx/samples/sampleOnnx'

Do we need to set any environment variable first?
Could you help to check on this.

Thanks.

Hi @AastaLLL, please do a make clean first.

Also don’t forget the instructions of the other issue:
Make project and run, change sampleOnnx.cpp line 211 from :
params.onnxFileName = "snet_slice.onnx";
To
params.onnxFileName = "snet_noslice.onnx";
to test both behaviors.

@AastaLLL also, please download the sample for here https://drive.google.com/file/d/1GUw5pWP73Ej_FmdxjhZVvtwzuIXdtxXh/view
As mentioned in the other issue, the link in this issue works but doesn’t show the difference between the slice and not using slice. I just check and this link doesn’t have the /bin so no need to make clean first.

Hi,

I check your onnx file on both host and device. The output are all

device with TensorRT 6.0.1.

$ ./sample_onnx
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
INPUT DIMS: 3 480 720
OUTPUT DIMS: 2 1 1

host with TensorRT 6.0.1

$ ./sample_onnx
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
WARNING: TensorRT was linked against cuBLAS 10.2.0 but loaded cuBLAS 10.1.0
WARNING: TensorRT was linked against cuBLAS 10.2.0 but loaded cuBLAS 10.1.0
INPUT DIMS: 3 480 720
OUTPUT DIMS: 2 1 1

The output dimension are the same in our environment.
Do we miss anything?

We also check your model with trtexec and the output dim are also [2,1,1].
Thanks.

Yes, that is the issue @AastaLLL, check Slice layer broken on TensorRT 6.0.1-1 please, and download the files from THIS LINK, you are using an old link, https://drive.google.com/file/d/1GUw5pWP73Ej_FmdxjhZVvtwzuIXdtxXh/view
I mentioned in the other issue, the output sizes do not correlate with the right output sizes, if you check the other issue it’s better explained there, the slice layer is changing the dimension of the output. Check the other issue and the new file so you can test the behavior with a network with and without slice.

Hi,

Sorry for the late update.

We tested your new link and get the following output:

1x1x98x98 without slice layer.
1x1x50x50 with slice layer

Is this identical to your observation.
If not, would you mind to check the sample with JetPack4.4 to see if helps?
Our result is generated with the trtexec from TensorRT7.1 of JetPack4.4.

Thanks.

Hi @AastaLLL, we found the issue on jetpack 4.3 and tensorrt 6 as I mentioned in the other issue. I just ended up discarting the slice layer, I guess it was fixed on trt7, since we use 4.3 I think this doesn’t apply to us yet. Regards

Hi,

Just want to clarify.
Is the output dimension from TensorRT 7.1 correct?

Thanks.

I didn’t test on trt 7, as I said I tested on a ngc container with trt6. Will test on trt7 latter, but on trt6 which is present on jetpack 4.3 the issue was present.

Hi,

Sorry for the unclear statement.

We try TensorRT7.1 on the model you provided and get the output dimension of this:

1x1x98x98 without slice layer.
1x1x50x50 with slice layer

Is this the expected output dimension?
Thanks.

With TensorRT7.1 it seems that it’s accurate, but you didn’t test on TensorRT6, which is where I reported the issue is appearing.