INT8 and INT4 performance on ORIN AGX

My ORIN AGX developer kit has the following specs:

Jetpack 6.0
L4T 36.3.0
Cuda: 12.2
Pytorch: 2.3.0

While running some LLM Inference code locally using the Transformers library and using BitsandBytes to quantize the models to INT8 and INT4, I noticed that the GPU is not being utilized fully (it stays at 99% when performing inference in FP16).

Is there something I need to do to perform INT8 and INT4 quantized inference??

Quantizing has reduced the model’s memory footprint but the latency has increased for performing the inference and thus the tokens/sec have dropped immensely.

I can attach the screenshot of my csv file where I am logging the stats using tegrastats:

Here is how I am initializing my model:

bnb_config = BitsAndBytesConfig(load_in_8bit=True,)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    cache_dir=cache_dir,
    quantization_config=bnb_config,
    device_map="cuda"
)

Dear @mayankarya ,
Could you share complete code and steps to reproducd the issue.
Note that, tegrastats GPU utilization indicates the percentage of active samples out of total queried samples

Hey, I am running a batched LLM Inference workload in the following manner:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

cache_dir = "<path to where you'd like the model to be stored>"

bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
)

# Use the 8-bit loading with `bitsandbytes`
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B",
    cache_dir=cache_dir,
    quantization_config = bnb_config,
    device_map="cuda"    # Automatically map layers to GPU
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B", cache_dir=cache_dir)

# Perform inference
inputs = tokenizer("What is artificial intelligence?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))

You may pass a list of input prompts as the first parameter to the line below to perform the batched inference :

inputs = tokenizer("What is artificial intelligence?", return_tensors="pt").to("cuda")

While this runs, in parallel, in a separate terminal I log system info using sudo tegrastats and retrieve the GPU util values (which I also showed you the screenshot of, in the csv file)

Apart from this I also run and monitor system vitals on using the visual jtop command. There as well I never see the GPU green bar going to the end (till 100%) for INT8 and INT4 workloads.

To run FP16 workload: just replace the model loading part with this snippet

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B",
    cache_dir=cache_dir,
    torch_dtype=torch.float16,
    device_map="cuda"    # Automatically map layers to GPU
)

here you would notice the GPU util to go till maximum (stays consistently at 99%)

Note that I have seen this behaviour on 2 separate LLM models:

  1. meta-llama/Meta-Llama-3.1-8B (used in the given code)
  2. microsoft/phi-2

FP16 and FP32 utilize the GPU fully, INT8 however doesnt.

Feel free to ask for other implementation details.

Could you test with pytorch from jp6/cu126 index ?

Hi,

It looks like GPU is used but not fully occupied.
To get better GPU utilization, you can try to deploy it with MLC or llama.cpp.

Thanks.