Running Inference on a 405B Model with an RTX 4070 Super – Thoughts on Optimizing Further?

rfcinnabon · December 19, 2024, 8:00pm

Hey everyone,

I’m new here, but I’ve been diving deep into AI model optimization lately and wanted to share something I’ve been working on. I managed to get inference running on the 405B LLaMA model with just a 4070 Super GPU. It took me about 26 hours to complete, but it worked! I’m pretty excited about this since I’m using consumer-grade hardware, and I’m curious if anyone else has tried something similar.

While working on this, I’ve been playing around with an idea I’m calling a VPool System—basically a framework to dynamically allocate GPU resources to handle larger models without needing high-end hardware. It’s still a work in progress, but here’s what I’ve tried so far:

Dynamic VRAM Pooling: Preloading model layers based on predictions to stay within the GPU’s VRAM limits.
Async Offloading: Offloading inactive layers to CPU or even NVMe storage when GPU memory gets tight.
Quantization: Dropping the precision down to 4-bit to save memory while keeping performance reasonable.

I’ve got two main questions for you all:

Has anyone else run inference on models this size with similar (or less) hardware? If so, I’d love to hear how you approached it.
For those who’ve experimented with asynchronous offloading, what’s the best way to speed it up or avoid bottlenecks?

This project has been a mix of wins and trial-and-error for me, so I’d love any insights or advice from those who’ve tackled these kinds of challenges.

Looking forward to hearing from you and learning more about what’s possible in this space!

Cheers,
Ross

noahmater711 · January 28, 2025, 6:20pm

We’re gonna need a github repo or it didn’t happen.

rfcinnabon · February 8, 2025, 6:47pm

I’ve developed an optimized inference pipeline that leverages offloading and mixed precision to efficiently run large language models on consumer GPUs (like the 4070 Super with 12GB VRAM). The project—named Vpool—demonstrates how to keep GPU memory usage low while still achieving effective inference.

The repository includes an optimized Python script along with documentation on installation and usage.

GitHub Repository:

https://github.com/ftrou/Vpool

Installation and Usage Instructions:

How to Use:

Clone the repository:

git clone https://github.com/ftrou/Vpool.git

Navigate into the project folder:

cd Vpool

Install the required dependencies:

pip install -r requirements.txt

Run the inference script:

python 405vpool.py

The script will load the model, run inference on a sample input, and print performance metrics.

License & Patent Notice:
This project is covered under a proprietary license, and aspects of the technology are patented. Please refer to the LICENSE file in the repository for details.

I welcome feedback, questions, or suggestions for improvement. Feel free to reach out via GitHub or reply to this thread.

Topic		Replies	Views
Is it possible to deploy the Llama-70b model with TensorRT LLM on an L40S GPU? TensorRT tensorrt , ubuntu , inference-server-triton	2	576	May 30, 2024
Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs Technical Blog llama	2	60	September 17, 2024
NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs Technical Blog	5	1053	September 27, 2023
NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1 Technical Blog	2	33	August 28, 2024
NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records Technical Blog	1	275	March 27, 2024
Minimum GPU to train yolov5 extra large (version 6) model with 4K images cuDNN	1	853	March 30, 2023
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1734	January 25, 2024
Recommend Compute for running a TensorRT-LLM using LLama2 13B & 70B model TensorRT	2	1043	November 15, 2023
How to add custom model to chat with rtx? AI Foundation Models and Endpoints	6	7156	February 23, 2024
Achieving Top Inference Performance with the NVIDIA H100 Tensor Core GPU and NVIDIA TensorRT-LLM Technical Blog	1	965	December 14, 2023

Running Inference on a 405B Model with an RTX 4070 Super – Thoughts on Optimizing Further?

Related topics