How to do Batch inference on the Xavier

Hey all,

I’ve been trying to get my xavier to perform inference with a batch size >1. I tried adapting the code from here: GitHub - NVIDIA-AI-IOT/tf_to_trt_image_classification: Image classification with NVIDIA TensorRT from TensorFlow models. but it always fails to classify on imagnet with vgg_16 when I have a batch size > 1. I have set and enabled a batch size of 2 on both the plan and in the engine prior to execution so I’m think I am doing i/o for the cuda buffers wrong. I have tried using a vector, array, and a allocated amount of space.

What is the correct/best way to store multiple images in the input buffer for execution?

Hi,

The best way is to extend the input/output tensor from 1xCxHxW to NxCxHxW.

Please noticed that if your output layer have 1000x1x1 dimension.
To inference with batchsize=2 will return a 2x1000x1x1 tensor and the probability of second image is at index=1000-1999.

Thanks.

Hey Aasta,

I understand that theoretically how I’m supposed to extend my input/output tensor, but I fail to get it to work in practice. I’m doing this in C++, and i’m a bit new to the language, what is the best data structure to store the tensor in? The example code above uses a float pointer for CxHxW, my attempts to extend this as an array of float pointers to form the NxCxHxW have largely failed, what am I missing on the formatting?

Thanks,

Hi,

You can do this by allocating a N times larger buffer and put each image at the kth*size position.
Here is a related sample for your reference:
[url]https://github.com/dusty-nv/jetson-inference/blob/master/tensorNet.cpp#L849[/url]

Thanks.

Hey Aasta,

That actaully appears to be what I’m doing in my code below:

const size_t height = image_vect[0].rows;
  const size_t width = image_vect[0].cols;
  const size_t channels = image_vect[0].channels();
  const size_t numel = height * width * channels * batchsize;

  const size_t stridesCv[3] = { width * channels, channels, 1 };
  const size_t strides[3] = { height * width, width, 1 };

  float * tensor;

  cudaHostAlloc((void**)&tensor, numel * sizeof(float), cudaHostAllocMapped);

However when I do this, the output class I get for both images when I run inference is wrong. This leads me to believe that my data is getting stored wrong.

for(int x=1; x< batchsize+1; x++) {
  for (int i = 0; i < height; i++)
  {
    for (int j = 0; j < width; j++)
    {
      for (int k = 0; k < channels; k++)
      {
        const size_t offsetCv = i * stridesCv[0] + j * stridesCv[1] + k * stridesCv[2];
        const size_t offset = x * (k * strides[0] + i * strides[1] + j * strides[2]);
        tensor[offset] = (float)  image_vect[x-1].data[offsetCv];
      }
    }
  }
}

This code is what I use to store the images into the allocated data space, do you see anything wrong with this implementation? image_vect is just a vector cv::Mat data stucture containing the images I read for inputs.

Let me know what uou think. Thanks for sticking with me.

Hi,

Would you mind to check if the output is in the NHWC format.
If the model includes an NHWC-dependent operation, TensorRT will automatically add a format converter to ensure the output is correct.

Thanks.