Hey, I am running a batched LLM Inference workload in the following manner:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
cache_dir = "<path to where you'd like the model to be stored>"
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
)
# Use the 8-bit loading with `bitsandbytes`
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B",
cache_dir=cache_dir,
quantization_config = bnb_config,
device_map="cuda" # Automatically map layers to GPU
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B", cache_dir=cache_dir)
# Perform inference
inputs = tokenizer("What is artificial intelligence?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
You may pass a list of input prompts as the first parameter to the line below to perform the batched inference :
inputs = tokenizer("What is artificial intelligence?", return_tensors="pt").to("cuda")
While this runs, in parallel, in a separate terminal I log system info using sudo tegrastats
and retrieve the GPU util values (which I also showed you the screenshot of, in the csv file)
Apart from this I also run and monitor system vitals on using the visual jtop command. There as well I never see the GPU green bar going to the end (till 100%) for INT8 and INT4 workloads.
To run FP16 workload: just replace the model loading part with this snippet
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B",
cache_dir=cache_dir,
torch_dtype=torch.float16,
device_map="cuda" # Automatically map layers to GPU
)
here you would notice the GPU util to go till maximum (stays consistently at 99%)
Note that I have seen this behaviour on 2 separate LLM models:
- meta-llama/Meta-Llama-3.1-8B (used in the given code)
- microsoft/phi-2
FP16 and FP32 utilize the GPU fully, INT8 however doesnt.
Feel free to ask for other implementation details.