vLLM vs NVIDIA NIM

We are experimenting with running vLLM as an open-source solution and also evaluating NVIDIA NIM services. Interestingly, our initial observations show that vLLM is performing better than NVIDIA NIM in our setup. Is this expected behavior, or are we possibly missing any configuration, optimization, or tuning aspects on the NIM side? we have a debate going on . How we prove NVIDIA is better …

Hi Prasana, can you elaborate more on which model & GPU were tested, which evaluation harness (eg, genAI perf, vllm bench), the concurrency level, and which primary metric you use? It would be good if you could share some reproduction code as well, so we can look further into it.

TensorRT-LLM supposed to me more better than vLLM . But the results are showing differnt way . Here are the details …. Please suggest to improve performence …. # NVIDIA NIM Performance Investigation Report

**Date:** 2026-01-12

**Issue:** Low performance compared to other models

**Pod:** mistral-nim-trtllm-5d644f97df-xdm4d

**Namespace:** nim

-–

## 1. GPU Configuration

| Parameter | Value |

|-----------|-------|

| GPU Model | NVIDIA H100 NVL |

| Device ID | 2321:10de |

| Total Memory | 99,949,805,568 bytes (~93 GB) |

| Compute Capability | 9.0 |

| GPUs Used | 1 (device index 0) |

-–

## 2. Model Configuration

| Parameter | Value |

|-----------|-------|

| Model Name | `mistral/mistral-small-24b-instruct-2501` |

| Architecture | MistralForCausalLM |

| Model Size | 24B parameters |

| Hidden Size | 5120 |

| Num Attention Heads | 32 |

| Num Key-Value Heads | 8 (Grouped Query Attention) |

| Num Hidden Layers | 40 |

| Intermediate Size | 32768 |

| Head Dimension | 128 |

| Vocab Size | 131072 |

| Max Position Embeddings | 32768 |

| Activation Function | SiLU |

| RoPE Theta | 100,000,000.0 |

| Torch Dtype | bfloat16 |

-–

## 3. TensorRT-LLM Engine Configuration

| Parameter | Value |

|-----------|-------|

| TensorRT-LLM Version | 0.17.1 |

| NIM LLM API Version | 1.8.4 |

| PyTorch Version | 2.6.0a0+ecf3bae40a.nv25.1 |

| Profile Selected | `tensorrt_llm-trtllm_buildable-bf16-tp1-pp1` |

| Precision | bfloat16 |

| Tensor Parallel Size | 1 |

| Pipeline Parallel Size | 1 |

| Max Batch Size | 512 |

| Max Sequence Length | 32768 |

| Max Num Tokens | 8192 |

| Block Size | 16 |

| GPU Memory Utilization | 0.9 (90%) |

| Engine Size | 74,122,556,964 bytes (~69 GB) |

| Engine Build Time | 304.65 seconds |

| Available Device Memory | 99,949,805,568 bytes |

| KV Cache Dtype | auto |

| Quantization | None (full precision bf16) |

| LoRA Enabled | False |

| Prefix Caching | Disabled |

| Chunked Prefill | Not enabled |

| Speculative Decoding | None |

-–

## 4. Benchmark Results Summary

### Performance Metrics by Input/Output Token Configuration

| Benchmark ID | Input Tokens | Output Tokens | TTFT (ms) | TPS (tokens/s) | Status |

|--------------|--------------|---------------|-----------|----------------|--------|

| i128-o128-r1 | 128 | 128 | 41.88 | 45.56 | Success |

| i128-o128-r2 | 128 | 128 | 58.41 | 42.73 | Success |

| i128-o256-r1 | 128 | 256 | 43.57 | 45.06 | Success |

| i128-o256-r2 | 128 | 256 | 59.71 | 43.56 | Success |

| i256-o128-r1 | 256 | 128 | 43.63 | 44.43 | Success |

| i256-o128-r2 | 256 | 128 | 59.04 | 43.21 | Success |

| i256-o256-r1 | 256 | 256 | 43.26 | 44.51 | Success |

| i256-o256-r2 | 256 | 256 | 59.86 | 43.59 | Success |

| i512-o128-r1 | 512 | 128 | 45.26 | 49.74 | Success |

| i512-o128-r2 | 512 | 128 | 70.06 | 49.16 | Success |

| i512-o256-r1 | 512 | 256 | 67.06 | 48.81 | Success |

| i512-o256-r2 | 512 | 256 | 67.20 | 48.45 | Success |

| i1024-o128-r1 | 1024 | 128 | 47.09 | 45.41 | Success |

| i1024-o128-r2 | 1024 | 128 | 66.29 | 45.03 | Success |

| i1024-o256-r1 | 1024 | 256 | 50.23 | 45.43 | Success |

| i1024-o256-r2 | 1024 | 256 | 68.55 | 45.57 | Success |

| i2048-o128-r1 | 2048 | 128 | 47.44 | 45.45 | Success |

| i2048-o128-r2 | 2048 | 128 | 66.01 | 44.81 | Success |

| i2048-o256-r1 | 2048 | 256 | 51.43 | 45.61 | Success |

| i2048-o256-r2 | 2048 | 256 | 69.95 | 45.15 | Success |

| i4096-o128-r1 | 4096 | 128 | 65.60 | 46.50 | Success |

| i4096-o128-r2 | 4096 | 128 | 96.61 | 45.89 | Success |

| i4096-o256-r1 | 4096 | 256 | 85.82 | 45.42 | Success |

| i4096-o256-r2 | 4096 | 256 | 95.56 | 45.26 | Success |

| i8192-o128-r1 | 8192 | 128 | 65.87 | 46.50 | Success |

| i8192-o128-r2 | 8192 | 128 | 88.60 | 45.66 | Success |

| i8192-o256-r1 | 8192 | 256 | 72.61 | 45.47 | Success |

| i8192-o256-r2 | 8192 | 256 | 97.16 | 45.25 | Success |

| i4096-o1024-r8 | 4096 | 1024 | 267.43 | 44.15 | Success |

### Performance Summary Statistics

| Metric | Min | Max | Average |

|--------|-----|-----|---------|

| TTFT (ms) | 41.88 | 267.43 | ~65 |

| TPS (tokens/s) | 42.73 | 49.74 | ~45.5 |

-–

## 5. Potential Performance Concerns

### Observations:

1. **Single GPU Usage**: Only using 1x H100 NVL with TP=1, PP=1. No multi-GPU parallelism.

2. **No Quantization**: Running full bf16 precision (~69GB engine). Could benefit from FP8/INT8 quantization for better throughput.

3. **Prefix Caching Disabled**: `enable_prefix_caching: false` - could improve TTFT for repeated prompts.

4. **Chunked Prefill Not Enabled**: Could help with long context performance.

5. **No Speculative Decoding**: Not using any speculative decoding which could improve TPS.

6. **TPS Range**: 42-50 tokens/s output throughput appears low for H100 NVL with a 24B model.

7. **TTFT Scaling**: TTFT increases significantly with input length (42ms at 128 tokens to 267ms at 4096 tokens with 1024 output).

### Warnings in Logs:

```

WARNING: You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead.

ERROR: init 250 result=11

```

-–

## 6. Configuration Files Reference

### Engine Args JSON (key parameters):

```json

{

“dtype”: “bfloat16”,

“tensor_parallel_size”: 1,

“pipeline_parallel_size”: 1,

“max_num_seqs”: 256,

“gpu_memory_utilization”: 0.9,

“block_size”: 16,

“enable_prefix_caching”: false,

“use_v2_block_manager”: true,

“swap_space”: 4,

“max_seq_len_to_capture”: 8192,

“scheduling_policy”: “fcfs”

}

```

-–

## 7. Log Files Available

- `tensorrt-llm.log` - Full TensorRT-LLM pod logs (404 KB, 3675 lines)

- `benchmark-service.log` - Benchmark service logs (7.7 MB)

- `backend.log` - Backend service logs (5.4 KB)

- `frontend.log` - Frontend logs (1.8 KB)

-–

## 8. Questions for Support

1. What is the expected TPS for Mistral Small 24B on single H100 NVL with TensorRT-LLM 0.17.1?

2. Would enabling FP8 quantization improve throughput without significant quality loss?

3. Is the `ERROR: init 250 result=11` message during startup impacting performance?

4. Should prefix caching or chunked prefill be enabled for this workload?

5. Are there recommended TensorRT-LLM build parameters for optimal H100 NVL performance?

-–

**Report Generated:** 2026-01-12