Models
Thousands of models exist in the open-source community, all of which are accelerated on RTX today. Applications have differing needs for use-cases, requirements and performance. These requirements have to be considered in conjunction with the choice of inference backend. Additionally, many application developers choose to customize open-source models to fit their needs.
Sample requirements to consider: Context Length, Precision, Use-Case Quality, Languages, Modalities, Memory Usage, base model vs. instruct-tuned model.
Model Customization on device:
- Guide for model customization on RTX with NVIDIA RTX AI Toolkit (via WSL, using llama-factory or Unsloth)
- Guide for Unsloth - popular community tool (via WSL) and llama-factory
The table below showcases the most popular foundation models we are seeing in the community today. Additionally, there are 1000s of fine-tuned variations, and LoRA adapters available for these models on HF which developers can make use of. Most developers choose a Instruct-tuned variant of the base model for most chat use-cases. Most popular foundation models below:
Base Model | Use-case |
---|---|
Llama 3.2 (1B,3B) | Text Generation (Keep in mind it’s restricted in the EU) |
Llama 3.1 (8B) | Text Generation |
Ministral (3B, 8B) | Text Generation |
Mistral 7B | Text Generation |
Gemma-2 (2B, 9B) | Text Generation |
Phi-3.5 (3.8) | Text Generation |
Phi-3 Mini / Medium (3.3B and 14B) | Text Generation |
Qwen (0.5B, 1.5B, 3B, 7B, 14B) | Text Generation |
Nemotron-mini (4B, 8B) | Text Generation |
Llama 3.2-11B-Vision | Visual Understanding Model |
LLaVa | Visual Understanding Model |
CodeLlama | Coding |
CodeGemma | Coding |
DeepSeek Coder (6.7B) | Coding |
Developers can use this leaderboard for reference on best performing models (on academic benchmarks). Developers can also filter and explore on HuggingFace.
Inference Backends
Review these resources for comparisons, information and getting started resources:
Overview of Inferencing Backends
Windows Native Devs
For developers who are looking for Python and C++ Windows native solutions. We generally recommend:
- ORT-GenAI SDK (with DML backend) - Perf & coverage across PC
- Llama.cpp - Large OS community, ease-of-use. The only backend with native Vulkan backend.
- TensorRT - Max. Perf on NVIDIA GPUs, though vanilla TRT is not sufficient for LLMs. (TRT-LLM for Windows is not supported anymore)
- PyTorch-CUDA - Experimentation
Getting Started Resources -
- Get started with ORT GenAI-DML (Python & C API)
- Get started with Llama.cpp (Python & C++)
- Get started with TensorRT-LLM LoRA Deployment (Python)
- Get started with TensorRT-Model Optimizer & TensorRT-LLM Deployment (Python)
- Get Started with ORT-DML
Web Devs
Several solutions exist for web developers & cross-platform devs. Solutions that seem to be popular in the community, learn more and get started with these solutions:
- Web: Web-NN, ORT-Web (for Web integrations), ml5.js, Transformers.js
- Cross platform apps (JS): Transformers.js, Llama.cpp-node, ORT-Node (for JS), ORT-React (for React Native apps)
Example Workflows & Application Integration Tools
-
Function calling chat bot (Electron app using node-llama-cpp in Electron app): RTX-AI-Toolkit/examples/node-llama-cpp-app at main · NVIDIA/RTX-AI-Toolkit · GitHub
-
Agentic Frameworks:
- Langchain (Python & JS)
- LlamaIndex (Python)
- SemanticKernel (C#, Python, Java)
-
VectorDBs:
-
FAISS (Python)
-
DuckDB (C++)