INT8 and INT4 performance on ORIN AGX

My ORIN AGX developer kit has the following specs:

Jetpack 6.0
L4T 36.3.0
Cuda: 12.2
Pytorch: 2.3.0

While running some LLM Inference code locally using the Transformers library and using BitsandBytes to quantize the models to INT8 and INT4, I noticed that the GPU is not being utilized fully (it stays at 99% when performing inference in FP16).

Is there something I need to do to perform INT8 and INT4 quantized inference??

Quantizing has reduced the model’s memory footprint but the latency has increased for performing the inference and thus the tokens/sec have dropped immensely.

I can attach the screenshot of my csv file where I am logging the stats using tegrastats:

Here is how I am initializing my model:

bnb_config = BitsAndBytesConfig(load_in_8bit=True,)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    cache_dir=cache_dir,
    quantization_config=bnb_config,
    device_map="cuda"
)

Dear @mayankarya ,
Could you share complete code and steps to reproducd the issue.
Note that, tegrastats GPU utilization indicates the percentage of active samples out of total queried samples

Hey, I am running a batched LLM Inference workload in the following manner:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

cache_dir = "<path to where you'd like the model to be stored>"

bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
)

# Use the 8-bit loading with `bitsandbytes`
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B",
    cache_dir=cache_dir,
    quantization_config = bnb_config,
    device_map="cuda"    # Automatically map layers to GPU
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B", cache_dir=cache_dir)

# Perform inference
inputs = tokenizer("What is artificial intelligence?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))

You may pass a list of input prompts as the first parameter to the line below to perform the batched inference :

inputs = tokenizer("What is artificial intelligence?", return_tensors="pt").to("cuda")

While this runs, in parallel, in a separate terminal I log system info using sudo tegrastats and retrieve the GPU util values (which I also showed you the screenshot of, in the csv file)

Apart from this I also run and monitor system vitals on using the visual jtop command. There as well I never see the GPU green bar going to the end (till 100%) for INT8 and INT4 workloads.

To run FP16 workload: just replace the model loading part with this snippet

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B",
    cache_dir=cache_dir,
    torch_dtype=torch.float16,
    device_map="cuda"    # Automatically map layers to GPU
)

here you would notice the GPU util to go till maximum (stays consistently at 99%)

Note that I have seen this behaviour on 2 separate LLM models:

  1. meta-llama/Meta-Llama-3.1-8B (used in the given code)
  2. microsoft/phi-2

FP16 and FP32 utilize the GPU fully, INT8 however doesnt.

Feel free to ask for other implementation details.

Could you test with pytorch from jp6/cu126 index ?

Hi,

It looks like GPU is used but not fully occupied.
To get better GPU utilization, you can try to deploy it with MLC or llama.cpp.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.