I want to convert Llama 7b (fp16=True) on A10 (24GB) but I always hit the out of GPU memory (OOM) issue. I could convert a smaller model (i.e. ~1b by reducing number of hidden layers).
It would be useful for me to know roughly how much GPU memory (and breakdown of memory) TensorRT needs to convert a model with size X (number of parameters) to tensorrt engine using fp16 precision.
TensorRT Version: 220.127.116.11
GPU Type: A10
Nvidia Driver Version: 535.54.03
CUDA Version: 12.2.1
CUDNN Version: 8.9.4
Operating System + Version: Ubuntu 22.04
Python Version (if applicable): 3.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorrt:23.08-py3
I clone the TensorRT repo to my repo. I followed the code to convert GPT2.
The notebook code is here: https://github.com/jalola/TensorRT/blob/hungnguyen/llama_trt_demo/demo/HuggingFace/notebooks/llama-pretrained.ipynb
- Open an A10 gpu.
- Clone 2 repo:
- GitHub - jalola/transformers at hung/llama_trt_convert (checkout branch hung/llama_trt_convert)
- GitHub - jalola/TensorRT at hungnguyen/llama_trt_demo (checkout branch hungnguyen/llama_trt_demo)
- Run container & install transformer in editable mode
docker run --gpus all -it -p 8888:8888 --rm -v /home/:/home/ nvcr.io/nvidia/tensorrt:23.08-py3 bash cd <path to repo>/transformers pip install -e ".[dev]" pip install jupyterlab pip install onnx_graphsurgeon --extra-index-url https://pypi.ngc.nvidia.com
- Run notebook server
cd <path to repo>/TensorRT/demo/HuggingFace jupyter lab --allow-root --ip 0.0.0.0 --port 8888
- Convert model
If you have enough GPU memory and RAM, you can run the whole notebook. Right now, for big model like Llama 7b, I converted model to ONNX file by using bigger GPU (A40 or A100) and copy the ONNX to A10 GPU. Then in the notebook, I skip the converting to ONNX step and convert from ONNX to TensorRT directly from ONNX file