Convert onnx to int 8 trt engine

I convert an onnx model to fp32 trt engine and it works; But when I convert the same onnx model to int8 trt engine, there is errors as below:
[10/12/2023-17:31:46] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See “Lazy Loading” section of CUDA documentation 1. Introduction — CUDA C Programming Guide
[10/12/2023-17:31:46] [TRT] [I] Starting Calibration.
[10/12/2023-17:33:14] [TRT] [E] 1: [executionContext.cpp::executeInternal::1177] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [resizingAllocator.cpp::deallocate::105] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 3: [engine.cpp::~Engine::298] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/engine.cpp::~Engine::298, condition: mExecutionContextCounter.use_count() == 1. Destroying an engine object before destroying the IExecutionContext objects it created leads to undefined behavior.
)
[10/12/2023-17:33:14] [TRT] [E] 1: [cudaDriverHelpers.cpp::operator()::94] Error Code 1: Cuda Driver (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[10/12/2023-17:33:14] [TRT] [E] 2: [calibrator.cpp::calibrateEngine::1181] Error Code 2: Internal Error (Assertion context->executeV2(&bindings[0]) failed. )

What might be the issue and how can I fix the issue?

Hi @Will99 ,
What is the version of TRT you are using?
Also can you please share the onnx model and repro steps with us?

Thanks

engine_and_onnx.zip (45.7 MB)

Hello @AakankshaS
the attachment is the onnx file and the corresponding engine(fp32) that I created.
the versions of my environment info as below, including the version of tensorrt. And some other information that might be useful.

Thanks.

[11/01/2023-14:02:35] [TRT] [I] ONNX IR version: 0.0.6
[11/01/2023-14:02:35] [TRT] [I] Opset version: 11
[11/01/2023-14:02:35] [TRT] [I] Producer name: pytorch
[11/01/2023-14:02:35] [TRT] [I] Producer version: 2.0.0
[11/01/2023-14:02:35] [TRT] [I] Domain:
[11/01/2023-14:02:35] [TRT] [I] Model version: 0
[11/01/2023-14:02:35] [TRT] [I] Doc string:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 00000000:02:00.0 On | N/A |
| 55% 58C P8 10W / 105W | 495MiB / 8111MiB | 3% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2651 G /usr/lib/xorg/Xorg 216MiB |
| 0 N/A N/A 2897 G /usr/bin/gnome-shell 49MiB |
| 0 N/A N/A 11057 G /proc/self/exe 59MiB |
| 0 N/A N/A 2950504 G …205442604626590831,262144 54MiB |
| 0 N/A N/A 3653051 G …RendererForSitePerProcess 87MiB |
±----------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

TensorRT-8.6.1.6

below is the get_batch function defined in the calibrator.

def get_batch(self, names: Sequence[str], **kwargs) -> list:
    """Get batch data."""
    if self.count < self.dataset_length:
        input_group = self.calib_data[str(self.count)]
        ret = []
        # ret_batch = []
        for name in names:
            # input_group = self.calib_data[name]
            data_np = input_group[name][...].astype(np.float32)

            # tile the tensor so we can keep the same distribute
            opt_shape = self.input_shapes[name]['opt_shape']
            data_shape = data_np.shape

            reps = [
                int(np.ceil(opt_s / data_s))
                for opt_s, data_s in zip(opt_shape, data_shape)
            ]

            data_np = np.tile(data_np, reps)

            slice_list = tuple(slice(0, end) for end in opt_shape)
            data_np = data_np[slice_list]

            if 'voxels' == name:
                data_np_cuda_ptr_dict_voxels = cuda.mem_alloc(data_np.nbytes)
                cuda.memcpy_htod(data_np_cuda_ptr_dict_voxels,
                                np.ascontiguousarray(data_np))
                self.buffers[name] = data_np_cuda_ptr_dict_voxels
            elif 'num_points' == name:
                data_np_cuda_ptr_dict_num_points = cuda.mem_alloc(data_np.nbytes)
                cuda.memcpy_htod(data_np_cuda_ptr_dict_num_points,
                                np.ascontiguousarray(data_np))
                self.buffers[name] = data_np_cuda_ptr_dict_num_points
            elif 'coors' == name:
                data_np_cuda_ptr_dict_coors = cuda.mem_alloc(data_np.nbytes)
                cuda.memcpy_htod(data_np_cuda_ptr_dict_coors,
                                np.ascontiguousarray(data_np))
                self.buffers[name] = data_np_cuda_ptr_dict_coors
            ret.append(int(self.buffers[name]))
        # ret_batch = [ret]
        self.count += 1
        return ret
    else:
        return None