How to Deploy LLMs on RTX PCs

This document covers our recommendations to deploy large language models on NVIDIA RTX GPUs to get the best performance and features.

Selecting the Right Models for Your Needs

Thousands of models exist in the open-source community, all of which are accelerated on RTX today. Applications have differing needs for use-cases, requirements and performance. These requirements have to be considered in conjunction with the choice of inference backend. Additionally, many application developers choose to customize open-source models to fit their needs.

Sample requirements to consider: Context Length, Precision, Use-Case Quality, Languages, Modalities, Memory Usage, base model vs. instruct-tuned model.

The table below showcases the most popular foundation models we are seeing in the community today. Additionally, there are 100s of fine-tuned variations, and LoRA adapters available for these models on HF which developers can make use of. Most developers choose a Instruct-tuned variant of the base model for most chat use-cases. Most popular foundation models below:

Base Model Use-case
gpt-oss:20B Reasoning & Agentic tasks
Deepseek-r1 Family Reasoning Models
Gemma3 Family Text Generation Model
Qwen3 Family Text Generation Model
Llama3 Family Text Generation Model
Mistral Family Text Generation Model

Developers can use this leaderboard for reference on best performing models (on academic benchmarks). Developers can also filter and explore on HuggingFace.

Compute Requirements & Considerations

One of the key considerations when deploying LLM models on the PC is the VRAM requirement for the model. To get maximum quality within a VRAM budget, developers optimize models with quantization techniques. These techniques shrink the precision format of model weights & activations, to lower the VRAM usage and accelerate the computation during inference. Popular quantization formats include: MXFP4, Q4_K_M and Int4.

In addition, when considering different LLMs developers often contemplate the following considerations: Context Length, Precision, Use-Case Quality, Languages, Modalities, Memory Usage, base model vs. instruct-tuned model.

Inference & Deployment

For developers who are looking for Python and C++ Windows native solutions. We generally recommend:

  1. Ollama: For LLM developers who want to experiment quickly with the latest local LLM models. Wide reach cross-vendor and cross-OS support, employs a server-based approach
  2. Llama.cpp - For LLM developers who want to experiment quickly with the latest local LLM models. Wide reach cross-vendor and cross-OS support, with a path to in-app deployment on the PC.
  3. Windows ML with NVIDIA TensorRT for RTX - For application developers building AI features for Windows PC applications.
  4. TensorRT for RTX - Maximum performance and full flexibility on NVIDIA GPUs.
Ollama / Llama.cpp #### Windows ML with NVIDIA TensorRT for RTX TensorRT for RTX
For LLM developers who want wide reach with cross-vendor and cross-OS support Application developers building AI features for Windows PC Windows application developers who want maximum control and flexibility of AI behavior on NVIDIA RTX GPUs
Performance Fast Fastest Fastest
OS Support Windows, Linux, and Mac Windows Windows and Linux
Hardware Support Any GPU or CPU Any GPU or CPU NVIDIA RTX GPUs
Model Checkpoint Format GGUF or GGML ONNX ONNX
Installation Process Installation of Python packages required Pre-installed on Windows Install SDK and Python bindings
LLM Support
Model Optimizations Llama.cpp Microsoft Olive TensorRT-Model Optimizer
Python
C/C++
C#/.NET -
Javascript -

Review these resources for comparisons, information and getting started resources:

Fine-Tuning

Developers can customize and fine-tune LLM models on their RTX AI PCs to tailor model responses to particular use-cases and tasks. For example, a custom personal email writing assistant. To do this, we recommend that you use Unsloth which is a popular community framework for quantization, adapter tuning, fine-tuning, and reinforcement learning on NVIDIA RTX GPUs.

Refer to the following getting started guide for Unsloth.

Example Workflows & Application Integration Tools