vLLM vs NVIDIA NIM

prasanna.mattegunta · January 10, 2026, 9:36am

We are experimenting with running vLLM as an open-source solution and also evaluating NVIDIA NIM services. Interestingly, our initial observations show that vLLM is performing better than NVIDIA NIM in our setup. Is this expected behavior, or are we possibly missing any configuration, optimization, or tuning aspects on the NIM side? we have a debate going on . How we prove NVIDIA is better …

vinhn · January 11, 2026, 7:18am

Hi Prasana, can you elaborate more on which model & GPU were tested, which evaluation harness (eg, genAI perf, vllm bench), the concurrency level, and which primary metric you use? It would be good if you could share some reproduction code as well, so we can look further into it.

prasanna.mattegunta · January 12, 2026, 8:50pm

TensorRT-LLM supposed to me more better than vLLM . But the results are showing differnt way . Here are the details …. Please suggest to improve performence …. # NVIDIA NIM Performance Investigation Report

**Date:** 2026-01-12

**Issue:** Low performance compared to other models

**Pod:** mistral-nim-trtllm-5d644f97df-xdm4d

**Namespace:** nim

-–

## 1. GPU Configuration

| Parameter | Value |

|-----------|-------|

| GPU Model | NVIDIA H100 NVL |

| Device ID | 2321:10de |

| Total Memory | 99,949,805,568 bytes (~93 GB) |

| Compute Capability | 9.0 |

| GPUs Used | 1 (device index 0) |

-–

## 2. Model Configuration

| Parameter | Value |

|-----------|-------|

| Model Name | `mistral/mistral-small-24b-instruct-2501` |

| Architecture | MistralForCausalLM |

| Model Size | 24B parameters |

| Hidden Size | 5120 |

| Num Attention Heads | 32 |

| Num Key-Value Heads | 8 (Grouped Query Attention) |

| Num Hidden Layers | 40 |

| Intermediate Size | 32768 |

| Head Dimension | 128 |

| Vocab Size | 131072 |

| Max Position Embeddings | 32768 |

| Activation Function | SiLU |

| RoPE Theta | 100,000,000.0 |

| Torch Dtype | bfloat16 |

-–

## 3. TensorRT-LLM Engine Configuration

| Parameter | Value |

|-----------|-------|

| TensorRT-LLM Version | 0.17.1 |

| NIM LLM API Version | 1.8.4 |

| PyTorch Version | 2.6.0a0+ecf3bae40a.nv25.1 |

| Profile Selected | `tensorrt_llm-trtllm_buildable-bf16-tp1-pp1` |

| Precision | bfloat16 |

| Tensor Parallel Size | 1 |

| Pipeline Parallel Size | 1 |

| Max Batch Size | 512 |

| Max Sequence Length | 32768 |

| Max Num Tokens | 8192 |

| Block Size | 16 |

| GPU Memory Utilization | 0.9 (90%) |

| Engine Size | 74,122,556,964 bytes (~69 GB) |

| Engine Build Time | 304.65 seconds |

| Available Device Memory | 99,949,805,568 bytes |

| KV Cache Dtype | auto |

| Quantization | None (full precision bf16) |

| LoRA Enabled | False |

| Prefix Caching | Disabled |

| Chunked Prefill | Not enabled |

| Speculative Decoding | None |

-–

## 4. Benchmark Results Summary

### Performance Metrics by Input/Output Token Configuration

|--------------|--------------|---------------|-----------|----------------|--------|

| i128-o128-r1 | 128 | 128 | 41.88 | 45.56 | Success |

| i128-o128-r2 | 128 | 128 | 58.41 | 42.73 | Success |

| i128-o256-r1 | 128 | 256 | 43.57 | 45.06 | Success |

| i128-o256-r2 | 128 | 256 | 59.71 | 43.56 | Success |

| i256-o128-r1 | 256 | 128 | 43.63 | 44.43 | Success |

| i256-o128-r2 | 256 | 128 | 59.04 | 43.21 | Success |

| i256-o256-r1 | 256 | 256 | 43.26 | 44.51 | Success |

| i256-o256-r2 | 256 | 256 | 59.86 | 43.59 | Success |

| i512-o128-r1 | 512 | 128 | 45.26 | 49.74 | Success |

| i512-o128-r2 | 512 | 128 | 70.06 | 49.16 | Success |

| i512-o256-r1 | 512 | 256 | 67.06 | 48.81 | Success |

| i512-o256-r2 | 512 | 256 | 67.20 | 48.45 | Success |

| i1024-o128-r1 | 1024 | 128 | 47.09 | 45.41 | Success |

| i1024-o128-r2 | 1024 | 128 | 66.29 | 45.03 | Success |

| i1024-o256-r1 | 1024 | 256 | 50.23 | 45.43 | Success |

| i1024-o256-r2 | 1024 | 256 | 68.55 | 45.57 | Success |

| i2048-o128-r1 | 2048 | 128 | 47.44 | 45.45 | Success |

| i2048-o128-r2 | 2048 | 128 | 66.01 | 44.81 | Success |

| i2048-o256-r1 | 2048 | 256 | 51.43 | 45.61 | Success |

| i2048-o256-r2 | 2048 | 256 | 69.95 | 45.15 | Success |

| i4096-o128-r1 | 4096 | 128 | 65.60 | 46.50 | Success |

| i4096-o128-r2 | 4096 | 128 | 96.61 | 45.89 | Success |

| i4096-o256-r1 | 4096 | 256 | 85.82 | 45.42 | Success |

| i4096-o256-r2 | 4096 | 256 | 95.56 | 45.26 | Success |

| i8192-o128-r1 | 8192 | 128 | 65.87 | 46.50 | Success |

| i8192-o128-r2 | 8192 | 128 | 88.60 | 45.66 | Success |

| i8192-o256-r1 | 8192 | 256 | 72.61 | 45.47 | Success |

| i8192-o256-r2 | 8192 | 256 | 97.16 | 45.25 | Success |

| i4096-o1024-r8 | 4096 | 1024 | 267.43 | 44.15 | Success |

### Performance Summary Statistics

| Metric | Min | Max | Average |

|--------|-----|-----|---------|

| TTFT (ms) | 41.88 | 267.43 | ~65 |

| TPS (tokens/s) | 42.73 | 49.74 | ~45.5 |

-–

## 5. Potential Performance Concerns

### Observations:

1. **Single GPU Usage**: Only using 1x H100 NVL with TP=1, PP=1. No multi-GPU parallelism.

2. **No Quantization**: Running full bf16 precision (~69GB engine). Could benefit from FP8/INT8 quantization for better throughput.

3. **Prefix Caching Disabled**: `enable_prefix_caching: false` - could improve TTFT for repeated prompts.

4. **Chunked Prefill Not Enabled**: Could help with long context performance.

5. **No Speculative Decoding**: Not using any speculative decoding which could improve TPS.

6. **TPS Range**: 42-50 tokens/s output throughput appears low for H100 NVL with a 24B model.

7. **TTFT Scaling**: TTFT increases significantly with input length (42ms at 128 tokens to 267ms at 4096 tokens with 1024 output).

### Warnings in Logs:

```

WARNING: You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead.

ERROR: init 250 result=11

```

-–

## 6. Configuration Files Reference

### Engine Args JSON (key parameters):

```json

{

“dtype”: “bfloat16”,

“tensor_parallel_size”: 1,

“pipeline_parallel_size”: 1,

“max_num_seqs”: 256,

“gpu_memory_utilization”: 0.9,

“block_size”: 16,

“enable_prefix_caching”: false,

“use_v2_block_manager”: true,

“swap_space”: 4,

“max_seq_len_to_capture”: 8192,

“scheduling_policy”: “fcfs”

}

```

-–

## 7. Log Files Available

- `tensorrt-llm.log` - Full TensorRT-LLM pod logs (404 KB, 3675 lines)

- `benchmark-service.log` - Benchmark service logs (7.7 MB)

- `backend.log` - Backend service logs (5.4 KB)

- `frontend.log` - Frontend logs (1.8 KB)

-–

## 8. Questions for Support

1. What is the expected TPS for Mistral Small 24B on single H100 NVL with TensorRT-LLM 0.17.1?

2. Would enabling FP8 quantization improve throughput without significant quality loss?

3. Is the `ERROR: init 250 result=11` message during startup impacting performance?

4. Should prefix caching or chunked prefill be enabled for this workload?

5. Are there recommended TensorRT-LLM build parameters for optimal H100 NVL performance?

-–

**Report Generated:** 2026-01-12

Topic		Replies	Views
NIM TensorRT-LLM on H100 NVL Models nim , llama-31-8b-instruct , llama	2	340	November 22, 2024
LLM Performance Benchmarking: Measuring NVIDIA NIM Performance with GenAI-Perf Technical Blog nim , llama	0	163	May 6, 2025
Is NVIDIA NIM the same as TensorRT-LLM and Triton working together? TensorRT	0	1010	May 3, 2024
New NGC vLLM container image (vllm:26.01-py3) DGX Spark / GB10 cudnn , dali	7	1601	May 3, 2026
NVIDIA Nemotron 3 Nano NVFP4 extremely slow on dual-Blackwell 32GB VRAM system NVIDIA Nemotron cuda , jetson , nemotron	1	618	February 7, 2026
High-throughput serving Llama-3.1 on A100 w/ VLLM or Llama.cpp NVIDIA Nemotron llama	2	501	January 27, 2025
Testing NVIDIA-Nemotron-3-Nano-4B- Model on Nvidia DGX Spark/Jetson Thor/6000 Pro with vLLM DGX Spark / GB10 jetson , nemotron	1	286	March 22, 2026
TensorRT LLM for NIM Models nim	3	501	January 7, 2025
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	84	3516	March 20, 2026
Benchmark Report: Qwen3.6-35B-A3B-NVFP4 on NVIDIA DGX Spark, Jetson Thor, Blackwell 6000 Pro DGX Spark / GB10 Projects	11	4251	July 14, 2026

vLLM vs NVIDIA NIM

Related topics