Building TensorRT int8 for batch greater than 1 fails

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version: 5.1.5.0
GPU Type: RTX-2080
Nvidia Driver Version: 450.102.04
CUDA Version: 10.1
CUDNN Version: 7.5
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.4
Baremetal or Container (if container which image + tag):

Hello, I managed successfully run and and build an int8 engine for my custom model, however when I tried to increase the batch size, the calibration process fails with the following error:

[TensorRT] ERROR: engine.cpp (572) - Cuda Error in commonEmitTensor: 1 (invalid argument)
[TensorRT] ERROR: Failure while trying to emit debug blob.
engine.cpp (572) - Cuda Error in commonEmitTensor: 1 (invalid argument)
[TensorRT] ERROR: cuda/caskConvolutionLayer.cpp (355) - Cuda Error in execute: 1 (invalid argument)
[TensorRT] ERROR: cuda/caskConvolutionLayer.cpp (355) - Cuda Error in execute: 1 (invalid argument)

It keeps reproducing itself for every injected batch.

My code below:

class EntropyCalibrator(trt.IInt8EntropyCalibrator2):
def init(self, cfg, seq_list, cache_file):
# Whenever you specify a custom constructor for a TensorRT class,
# MUST call the constructor of the parent explicitly.
trt.IInt8EntropyCalibrator2.init(self)

    self.batch_size = 3
    self.batch_shape = (self.batch_size, IMG_CH, IMG_H, IMG_W)
    self.cache_file = cache_file

    self.cfg = cfg

    self.seq_list = seq_list
    self.frames_per_seq = list()
    self.delution_factor = cfg['delution_factor']
    for seq in seq_list:
        lidar_list = sorted([cfg['dataset_dir'] + seq + '/LIDAR_TOP/data/' + f.strip()
                            for f in open(cfg['dataset_dir'] + seq + '/LIDAR_TOP/samples.txt', 'r').readlines()])
        self.frames_per_seq.append(len(lidar_list))
    self.current_seq = 0

    self.counter = 0  # for keeping track of how many files we have read

    self.device_input = cuda.mem_alloc(trt.volume(self.batch_shape) * trt.float32.itemsize)

Inside the get_batch I use the following code:

    depthnet_input_batch = np.zeros((self.batch_size, IMG_H * IMG_W * IMG_CH), dtype=np.float32)
    for i, cam_data in enumerate(cameras_data):
        img = cam_data['img_data'].data.cpu().numpy()
        img = img.squeeze()
        img = img.transpose((2, 0, 1))
        img = img.ravel()
        img = np.ascontiguousarray(img)
        depthnet_input_batch[i, :] = img

    depthnet_input_batch = np.asarray(depthnet_input_batch[:self.batch_size]).ravel()

   cuda.memcpy_htod(self.device_input, depthnet_input_batch)

Hi @spivakoa,

For your reference, document of sample Inference In INT8 Using Custom Calibration,
https://docs.nvidia.com/deeplearning/tensorrt/sample-support-guide/index.html#int8_sample

Thank you.