Int8 Calibration Failed on custom layers

I created a customized onnx model as well as corresponding TensorRT plugins, and I can successfully convert the onnx model to TensorRT engine with fp32 and fp16 mode.

But when I try doing int8 calibration directly on my model, I met the following error:

[2020-02-28 05:23:22 ERROR] FAILED_ALLOCATION: std::exception
[2020-02-28 05:23:22 ERROR] Requested amount of memory (18446744065119617096 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[2020-02-28 05:23:22 ERROR] /home/jenkins/workspace/TensorRT/helpers/rel-6.0/L1_Nightly/build/source/rtSafe/resources.h (57) - OutOfMemory Error in CpuMemory: 0

[2020-02-28 05:23:22 ERROR] FAILED_ALLOCATION: std::exception
[2020-02-28 05:23:22 ERROR] Requested amount of memory (18446744065119617096 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[2020-02-28 05:23:22 ERROR] /home/jenkins/workspace/TensorRT/helpers/rel-6.0/L1_Nightly/build/source/rtSafe/resources.h (57) - OutOfMemory Error in CpuMemory: 0
[2020-02-28 05:23:22 ERROR] FAILED_ALLOCATION: std::exception
[2020-02-28 05:23:22 ERROR] Requested amount of memory (18446744065119617096 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
terminate called after throwing an instance of 'std::out_of_range'
what(): _Map_base::at

Weird thing is I succeed the calibration process by providing the program a fake cache table, this cache table is generated from a subset of the whole model (say the backbone) calibration.

My machine has memory of 64GB, so I am confused why this allocation exception happen.

Anyone can help? Thanks.
Need more information? Please let me know.

Hi,

Could you please provide details on the platforms you are using:
o Linux distro and version
o GPU type
o Nvidia driver version
o CUDA version
o CUDNN version
o Python version [if using python]
o Tensorflow and PyTorch version
o TensorRT version
If possible, please share the script & model file to reproduce the issue.

Thanks

Hi,

  • OS: Ubuntu 16.04
  • GPU: 2080ti
  • driver: 418.56
  • CUDA version: 10.1
  • CUDNN:
  • Python: 3.7
  • PyTorch: 1.4.0
  • TensorRT Version: 6.0.1.5

Could you please share the script & model file to reproduce the issue?

Thanks

Sorry, but I cannot provide any scripts.

Hi,

Based on the logs snapshot, it’s seems to be trying to allocate unrealistic amount of memory:
“Requested amount of memory (18446744065119617096 bytes) could not be allocated.”

Without script to reproduce/debug the issue, I can suggest following things:

  1. Use cudaMemcheck https://docs.nvidia.com/cuda/cuda-memcheck/index.html to root cause any memory issue in code
  2. Can you try running the code on clean system?
  3. Can you try running “sudo systemctl stop lightdm” and try again? That should free up some system memory by stopping the graphical display.

Thanks

Seems like the error has nothing to do with gpu memory.

And I have monitored both memory and GPU memory usage while program running, it was far from full.