TensorRT-LLM PyTorch Memory Fraction Automatically Limited to 0.14 Despite KV Cache Configuration

sh.ahn1 · December 16, 2025, 6:57am

When serving the GPT-OSS-20B model using TensorRT-LLM’s trtllm-serve command, the PyTorch memory fraction is automatically limited to 0.14 (14%) despite explicitly configuring KV cache memory through both YAML configuration files and CLI arguments.

Environment

Platform: NVIDIA DGX Spark / GB10
GPU: RTX 6000 ADA (120GB VRAM)
Container Image: nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
TensorRT-LLM Version: 1.1.0rc3
PyTorch Version: 2.8.0a0+5228986c39.nv25.6
Model: openai/gpt-oss-20b (FP8 quantized)

1. Create YAML Configuration File

cat > /tmp/extra-llm-api-config.yml <<'EOF'
print_iter_log: false
max_batch_size: 64
max_seq_len: 16384
max_input_len: 16384
max_num_tokens: 4096
enable_chunked_prefill: true

kv_cache_config:
  dtype: "auto"
  max_tokens: 200000
  enable_block_reuse: true
  onboard_blocks: true

cuda_graph_config:
  enable_padding: true

disable_overlap_scheduler: true
enable_iter_perf_stats: true
EOF

2. Launch Server

trtllm-serve "/models/gpt-oss-20b" \
  --host 0.0.0.0 \
  --port 8355 \
  --max_batch_size 64 \
  --max_seq_len 16384 \
  --max_input_len 16384 \
  --max_num_tokens 4096 \
  --trust_remote_code \
  --extra_llm_api_options /tmp/extra-llm-api-config.yml

3. Alternative Configurations Tested

Attempt A: Using free_gpu_memory_fraction

kv_cache_config:
  dtype: "auto"
  free_gpu_memory_fraction: 0.8

Attempt B: Using max_gpu_total_bytes

kv_cache_config:
  dtype: "auto"
  max_gpu_total_bytes: 85899345920  # 80GB

Attempt C: CLI arguments only (minimal YAML)

trtllm-serve "/models/gpt-oss-20b" \
  --host 0.0.0.0 --port 8355 \
  --max_batch_size 64 \
  --max_seq_len 16384 \
  --max_input_len 16384 \
  --max_num_tokens 4096 \
  --trust_remote_code
```

---

## Actual Behavior

### Initial Memory Profiling Stage
```
[12/16/2025-06:47:10] [TRT-LLM] [I] max_tokens is set by kv_cache_config.max_tokens: 4128
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.19 GiB for max tokens in paged KV cache (4128).
[12/16/2025-06:47:10] [TRT-LLM] [I] max_seq_len=4128, max_num_requests=64, max_num_tokens=4096, max_batch_size=64
```

Configured `max_tokens: 200000` is ignored and reduced to 4128.

### After Memory Profiling
```
[12/16/2025-06:47:17] [TRT-LLM] [I] Peak memory during memory usage profiling (torch + non-torch): 107.45 GiB
[12/16/2025-06:47:17] [TRT-LLM] [I] available KV cache memory when calculating max tokens: 11.20 GiB
[12/16/2025-06:47:17] [TRT-LLM] [I] fraction is set 0.9
[12/16/2025-06:47:17] [TRT-LLM] [I] device total memory 119.70 GiB
[12/16/2025-06:47:17] [TRT-LLM] [I] max_tokens is set by kv_cache_config.max_tokens: 244571
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.20 GiB for max tokens in paged KV cache (244576).
[12/16/2025-06:47:17] [TRT-LLM] [I] max_seq_len=16384, max_num_requests=64, max_num_tokens=4096, max_batch_size=64
```

After memory profiling, `max_tokens` is auto-adjusted to 244571 and configuration is partially applied.

### Final PyTorch Memory Allocation
```
[12/16/2025-06:47:20] [TRT-LLM] [I] Setting PyTorch memory fraction to 0.14019781316160024 (16.781333923339844 GiB)
```

**Issue**: PyTorch is limited to approximately 14% of total GPU memory.

---

## Expected Behavior

1. **KV Cache Memory Allocation**: Configured `max_tokens: 200000` or `free_gpu_memory_fraction: 0.8` should be accurately applied from the initial stage.

2. **PyTorch Memory**: Configuration values should be applied immediately without memory profiling, or at minimum maintain a fraction in the 0.8-0.9 range.

3. **Consistency**: CLI arguments and YAML settings should be applied without conflicts.

---

## Memory Analysis

### Current Memory Distribution
```
Total GPU Memory:     119.70 GB
Model Weights:         13.34 GB (inside torch)
CUDA Graphs/NCCL:     93.67 GB (outside torch)
KV Cache:             11.20 GB
Peak Memory:         107.45 GB
```

### Issues Identified
- **CUDA Graphs + NCCL**: 93.67GB (78% of total) consumed
- **PyTorch Limitation**: Restricted to 16.78GB (14%)
- **Available KV Cache**: 11.20GB (significantly less than target)

### Ideal Memory Distribution (80% KV cache target)
```
Model Weights:     ~15 GB
KV Cache:          ~96 GB (80% of 120GB)
Other Overhead:     ~9 GB

Impact

Concurrent User Limitation:
- Current: ~15 users (with 16K context)
- Expected: ~47 users (with 80GB KV cache)
Maximum Sequence Length:
- Configured: 16384 tokens
- Actually Possible: 244576 tokens (after memory profiling)
- However, may be unstable due to PyTorch limitation
Production RAG System: Performance degradation expected in multi-user concurrent processing scenarios

docker run \
  --name trtllm_server --rm -it \
  --gpus all --ipc=host --network host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v /home/coga/trtllm/models:/models:ro \
  -v /home/coga/trtllm/encodings:/tiktoken:ro \
  -v /home/coga/trtllm/engines:/engines \
  nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
  bash

Topic		Replies	Views
Whether the GPU's memory resources can be set in TensorRT ？ TensorRT	1	520	October 28, 2020
Override max_num_seqs on nvcr.io/nim/meta/llama-3.2-11b-vision-instruct Models nim , llama	4	273	February 12, 2025
Manage TensorRT GPU memory conversion usage TensorRT tensorrt , tensorflow , ubuntu	3	2778	April 7, 2021
How to Limit GPU usage of Tensorrt Engine inference? TAO Toolkit tensorrt	2	1020	September 18, 2021
Vllm docker-compose - on DGX Spark from first time user looking for suggestions and question about RAM utilization DGX Spark / GB10 docker	5	595	December 10, 2025
How to get total used gpu memory and set gpu memory limit when tensorrt inferring? TensorRT tensorrt	3	892	November 2, 2022
Calling cuda() consumes all the RAM memory Jetson Nano cuda	4	1677	October 3, 2021
NIM - Llama 3 8B Instruct - Results were very weirdn Models nim	1	419	August 27, 2024
Multiple TensorRT models Inference on Jetson Orin Jetson AGX Orin tensorrt	2	337	April 24, 2024
Limiting a tensorflow deployed savedmodel memory consumption Triton Inference Server (archived)	2	1249	August 6, 2019

TensorRT-LLM PyTorch Memory Fraction Automatically Limited to 0.14 Despite KV Cache Configuration

Environment

Impact

Related topics