How to Deploy LLMs on RTX PCs

TomNVIDIA · December 18, 2024, 5:30pm

This document covers our recommendations to deploy large language models on NVIDIA RTX GPUs to get the best performance and features.

Selecting the Right Models for Your Needs

Thousands of models exist in the open-source community, all of which are accelerated on RTX today. Applications have differing needs for use-cases, requirements and performance. These requirements have to be considered in conjunction with the choice of inference backend. Additionally, many application developers choose to customize open-source models to fit their needs.

Sample requirements to consider: Context Length, Precision, Use-Case Quality, Languages, Modalities, Memory Usage, base model vs. instruct-tuned model.

The table below showcases the most popular foundation models we are seeing in the community today. Additionally, there are 100s of fine-tuned variations, and LoRA adapters available for these models on HF which developers can make use of. Most developers choose a Instruct-tuned variant of the base model for most chat use-cases. Most popular foundation models below:

Base Model	Use-case
gpt-oss:20B	Reasoning & Agentic tasks
Deepseek-r1 Family	Reasoning Models
Gemma3 Family	Text Generation Model
Qwen3 Family	Text Generation Model
Llama3 Family	Text Generation Model
Mistral Family	Text Generation Model

Developers can use this leaderboard for reference on best performing models (on academic benchmarks). Developers can also filter and explore on HuggingFace.

Compute Requirements & Considerations

One of the key considerations when deploying LLM models on the PC is the VRAM requirement for the model. To get maximum quality within a VRAM budget, developers optimize models with quantization techniques. These techniques shrink the precision format of model weights & activations, to lower the VRAM usage and accelerate the computation during inference. Popular quantization formats include: MXFP4, Q4_K_M and Int4.

In addition, when considering different LLMs developers often contemplate the following considerations: Context Length, Precision, Use-Case Quality, Languages, Modalities, Memory Usage, base model vs. instruct-tuned model.

Inference & Deployment

For developers who are looking for Python and C++ Windows native solutions. We generally recommend:

Ollama: For LLM developers who want to experiment quickly with the latest local LLM models. Wide reach cross-vendor and cross-OS support, employs a server-based approach
Llama.cpp - For LLM developers who want to experiment quickly with the latest local LLM models. Wide reach cross-vendor and cross-OS support, with a path to in-app deployment on the PC.
Windows ML with NVIDIA TensorRT for RTX - For application developers building AI features for Windows PC applications.
TensorRT for RTX - Maximum performance and full flexibility on NVIDIA GPUs.

	Ollama / Llama.cpp	#### Windows ML with NVIDIA TensorRT for RTX	TensorRT for RTX
For	LLM developers who want wide reach with cross-vendor and cross-OS support	Application developers building AI features for Windows PC	Windows application developers who want maximum control and flexibility of AI behavior on NVIDIA RTX GPUs
Performance	Fast	Fastest	Fastest
OS Support	Windows, Linux, and Mac	Windows	Windows and Linux
Hardware Support	Any GPU or CPU	Any GPU or CPU	NVIDIA RTX GPUs
Model Checkpoint Format	GGUF or GGML	ONNX	ONNX
Installation Process	Installation of Python packages required	Pre-installed on Windows	Install SDK and Python bindings
LLM Support
Model Optimizations	Llama.cpp	Microsoft Olive	TensorRT-Model Optimizer
Python
C/C++
C#/.NET			-
Javascript			-

Review these resources for comparisons, information and getting started resources:

Overview of Inferencing Backends
Getting Started Resources:
- Ollama Inferencing Guide | Optimization Guide
- Llama.cpp Inferencing Guide | Optimization Guide
- Windows ML: Inferencing Guide | Optimization Guide | Samples

Fine-Tuning

Developers can customize and fine-tune LLM models on their RTX AI PCs to tailor model responses to particular use-cases and tasks. For example, a custom personal email writing assistant. To do this, we recommend that you use Unsloth which is a popular community framework for quantization, adapter tuning, fine-tuning, and reinforcement learning on NVIDIA RTX GPUs.

Refer to the following getting started guide for Unsloth.

Example Workflows & Application Integration Tools

Function calling chat bot (Electron app using node-llama-cpp in Electron app): GitHub Link
Agentic Frameworks:
- Langchain (Python & JS)
- LlamaIndex (Python)
- SemanticKernel (C#, Python, Java)
VectorDBs:
- FAISS (Python)
- DuckDB (C++)

Topic		Replies	Views
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1800	January 25, 2024
Supercharging LLM Applications on Windows PCs with NVIDIA RTX Systems Technical Blog	1	357	January 8, 2024
Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM Technical Blog	3	543	April 18, 2024
Accelerating LLMs with llama.cpp on NVIDIA RTX Systems Technical Blog llama	1	86	October 2, 2024
Get Started with Generative AI Development for Windows PCs with NVIDIA RTX Technical Blog	8	803	March 21, 2024
How to add custom model to chat with rtx? NVIDIA Nemotron	6	7333	February 23, 2024
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	1	302	May 3, 2024
Build Custom Enterprise-Grade Generative AI with NVIDIA AI Foundation Models Technical Blog	0	334	November 15, 2023
Supercharging Llama 3.1 across NVIDIA Platforms Technical Blog	14	231	September 17, 2024
Is it possible to deploy the Llama-70b model with TensorRT LLM on an L40S GPU? TensorRT tensorrt , ubuntu , inference-server-triton	2	610	May 30, 2024