How much GPU memory does TensorRT need to convert a model (e.g. Llama 7b with fp16)


I want to convert Llama 7b (fp16=True) on A10 (24GB) but I always hit the out of GPU memory (OOM) issue. I could convert a smaller model (i.e. ~1b by reducing number of hidden layers).
It would be useful for me to know roughly how much GPU memory (and breakdown of memory) TensorRT needs to convert a model with size X (number of parameters) to tensorrt engine using fp16 precision.


TensorRT Version:
GPU Type: A10
Nvidia Driver Version: 535.54.03
CUDA Version: 12.2.1
CUDNN Version: 8.9.4
Operating System + Version: Ubuntu 22.04
Python Version (if applicable): 3.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

I clone the TensorRT repo to my repo. I followed the code to convert GPT2.
The notebook code is here:

Steps To Reproduce

  1. Open an A10 gpu.
  2. Clone 2 repo:
  1. Run container & install transformer in editable mode
docker run --gpus all -it -p 8888:8888 --rm -v /home/:/home/ bash
cd <path to repo>/transformers
pip install -e ".[dev]"
pip install jupyterlab
pip install onnx_graphsurgeon --extra-index-url
  1. Run notebook server
cd <path to repo>/TensorRT/demo/HuggingFace
jupyter lab --allow-root --ip --port 8888
  1. Convert model
    If you have enough GPU memory and RAM, you can run the whole notebook. Right now, for big model like Llama 7b, I converted model to ONNX file by using bigger GPU (A40 or A100) and copy the ONNX to A10 GPU. Then in the notebook, I skip the converting to ONNX step and convert from ONNX to TensorRT directly from ONNX file

LLaMa 7b itself in fp16 is ~14GB. As a rough but not guaranteed estimate, you need at least 2*14 = 28GB to convert the onnx model into TRT (which has been documented in the latest branch readme A10 GPU is only 24GB, so it is sometimes not enough to convert. I would also recommend to try

Thanks luxiaoz,
I was thinking the same that it TensorRT needs 2x memory of model size to convert model to TensorRT (similar to converting to ONNX. I also see it takes around 2x memory of the model size).

TensorRT-LLM is released, I will try it :)