This document covers our recommendations to deploy large language models on NVIDIA RTX GPUs to get the best performance and features.
Selecting the Right Models for Your Needs
Thousands of models exist in the open-source community, all of which are accelerated on RTX today. Applications have differing needs for use-cases, requirements and performance. These requirements have to be considered in conjunction with the choice of inference backend. Additionally, many application developers choose to customize open-source models to fit their needs.
Sample requirements to consider: Context Length, Precision, Use-Case Quality, Languages, Modalities, Memory Usage, base model vs. instruct-tuned model.
The table below showcases the most popular foundation models we are seeing in the community today. Additionally, there are 100s of fine-tuned variations, and LoRA adapters available for these models on HF which developers can make use of. Most developers choose a Instruct-tuned variant of the base model for most chat use-cases. Most popular foundation models below:
Base Model | Use-case |
---|---|
gpt-oss:20B | Reasoning & Agentic tasks |
Deepseek-r1 Family | Reasoning Models |
Gemma3 Family | Text Generation Model |
Qwen3 Family | Text Generation Model |
Llama3 Family | Text Generation Model |
Mistral Family | Text Generation Model |
Developers can use this leaderboard for reference on best performing models (on academic benchmarks). Developers can also filter and explore on HuggingFace.
Compute Requirements & Considerations
One of the key considerations when deploying LLM models on the PC is the VRAM requirement for the model. To get maximum quality within a VRAM budget, developers optimize models with quantization techniques. These techniques shrink the precision format of model weights & activations, to lower the VRAM usage and accelerate the computation during inference. Popular quantization formats include: MXFP4, Q4_K_M and Int4.
In addition, when considering different LLMs developers often contemplate the following considerations: Context Length, Precision, Use-Case Quality, Languages, Modalities, Memory Usage, base model vs. instruct-tuned model.
Inference & Deployment
For developers who are looking for Python and C++ Windows native solutions. We generally recommend:
- Ollama: For LLM developers who want to experiment quickly with the latest local LLM models. Wide reach cross-vendor and cross-OS support, employs a server-based approach
- Llama.cpp - For LLM developers who want to experiment quickly with the latest local LLM models. Wide reach cross-vendor and cross-OS support, with a path to in-app deployment on the PC.
- Windows ML with NVIDIA TensorRT for RTX - For application developers building AI features for Windows PC applications.
- TensorRT for RTX - Maximum performance and full flexibility on NVIDIA GPUs.
Ollama / Llama.cpp | #### Windows ML with NVIDIA TensorRT for RTX | TensorRT for RTX | |
---|---|---|---|
For | LLM developers who want wide reach with cross-vendor and cross-OS support | Application developers building AI features for Windows PC | Windows application developers who want maximum control and flexibility of AI behavior on NVIDIA RTX GPUs |
Performance | Fast | Fastest | Fastest |
OS Support | Windows, Linux, and Mac | Windows | Windows and Linux |
Hardware Support | Any GPU or CPU | Any GPU or CPU | NVIDIA RTX GPUs |
Model Checkpoint Format | GGUF or GGML | ONNX | ONNX |
Installation Process | Installation of Python packages required | Pre-installed on Windows | Install SDK and Python bindings |
LLM Support | |||
Model Optimizations | Llama.cpp | Microsoft Olive | TensorRT-Model Optimizer |
Python | |||
C/C++ | |||
C#/.NET | - | ||
Javascript | - |
Review these resources for comparisons, information and getting started resources:
- Overview of Inferencing Backends
- Getting Started Resources:
Fine-Tuning
Developers can customize and fine-tune LLM models on their RTX AI PCs to tailor model responses to particular use-cases and tasks. For example, a custom personal email writing assistant. To do this, we recommend that you use Unsloth which is a popular community framework for quantization, adapter tuning, fine-tuning, and reinforcement learning on NVIDIA RTX GPUs.
Refer to the following getting started guide for Unsloth.
Example Workflows & Application Integration Tools
-
Function calling chat bot (Electron app using node-llama-cpp in Electron app): GitHub Link
-
Agentic Frameworks:
- Langchain (Python & JS)
- LlamaIndex (Python)
- SemanticKernel (C#, Python, Java)
-
VectorDBs: