How to Deploy LLMs on RTX PCs

Models

Thousands of models exist in the open-source community, all of which are accelerated on RTX today. Applications have differing needs for use-cases, requirements and performance. These requirements have to be considered in conjunction with the choice of inference backend. Additionally, many application developers choose to customize open-source models to fit their needs.

Sample requirements to consider: Context Length, Precision, Use-Case Quality, Languages, Modalities, Memory Usage, base model vs. instruct-tuned model.

Model Customization on device:

The table below showcases the most popular foundation models we are seeing in the community today. Additionally, there are 1000s of fine-tuned variations, and LoRA adapters available for these models on HF which developers can make use of. Most developers choose a Instruct-tuned variant of the base model for most chat use-cases. Most popular foundation models below:

Base Model Use-case
Llama 3.2 (1B,3B) Text Generation (Keep in mind it’s restricted in the EU)
Llama 3.1 (8B) Text Generation
Ministral (3B, 8B) Text Generation
Mistral 7B Text Generation
Gemma-2 (2B, 9B) Text Generation
Phi-3.5 (3.8) Text Generation
Phi-3 Mini / Medium (3.3B and 14B) Text Generation
Qwen (0.5B, 1.5B, 3B, 7B, 14B) Text Generation
Nemotron-mini (4B, 8B) Text Generation
Llama 3.2-11B-Vision Visual Understanding Model
LLaVa Visual Understanding Model
CodeLlama Coding
CodeGemma Coding
DeepSeek Coder (6.7B) Coding

Developers can use this leaderboard for reference on best performing models (on academic benchmarks). Developers can also filter and explore on HuggingFace.

Inference Backends

Review these resources for comparisons, information and getting started resources:

Overview of Inferencing Backends

Windows Native Devs

For developers who are looking for Python and C++ Windows native solutions. We generally recommend:

  1. ORT-GenAI SDK (with DML backend) - Perf & coverage across PC
  2. Llama.cpp - Large OS community, ease-of-use. The only backend with native Vulkan backend.
  3. TensorRT - Max. Perf on NVIDIA GPUs, though vanilla TRT is not sufficient for LLMs. (TRT-LLM for Windows is not supported anymore)
  4. PyTorch-CUDA - Experimentation

Getting Started Resources -

Web Devs

Several solutions exist for web developers & cross-platform devs. Solutions that seem to be popular in the community, learn more and get started with these solutions:

  1. Web: Web-NN, ORT-Web (for Web integrations), ml5.js, Transformers.js
  2. Cross platform apps (JS): Transformers.js, Llama.cpp-node, ORT-Node (for JS), ORT-React (for React Native apps)

Example Workflows & Application Integration Tools