When serving the GPT-OSS-20B model using TensorRT-LLM’s trtllm-serve command, the PyTorch memory fraction is automatically limited to 0.14 (14%) despite explicitly configuring KV cache memory through both YAML configuration files and CLI arguments.
Environment
- Platform: NVIDIA DGX Spark / GB10
- GPU: RTX 6000 ADA (120GB VRAM)
- Container Image:
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev - TensorRT-LLM Version: 1.1.0rc3
- PyTorch Version: 2.8.0a0+5228986c39.nv25.6
- Model: openai/gpt-oss-20b (FP8 quantized)
1. Create YAML Configuration File
cat > /tmp/extra-llm-api-config.yml <<'EOF'
print_iter_log: false
max_batch_size: 64
max_seq_len: 16384
max_input_len: 16384
max_num_tokens: 4096
enable_chunked_prefill: true
kv_cache_config:
dtype: "auto"
max_tokens: 200000
enable_block_reuse: true
onboard_blocks: true
cuda_graph_config:
enable_padding: true
disable_overlap_scheduler: true
enable_iter_perf_stats: true
EOF
2. Launch Server
trtllm-serve "/models/gpt-oss-20b" \
--host 0.0.0.0 \
--port 8355 \
--max_batch_size 64 \
--max_seq_len 16384 \
--max_input_len 16384 \
--max_num_tokens 4096 \
--trust_remote_code \
--extra_llm_api_options /tmp/extra-llm-api-config.yml
3. Alternative Configurations Tested
Attempt A: Using free_gpu_memory_fraction
kv_cache_config:
dtype: "auto"
free_gpu_memory_fraction: 0.8
Attempt B: Using max_gpu_total_bytes
kv_cache_config:
dtype: "auto"
max_gpu_total_bytes: 85899345920 # 80GB
Attempt C: CLI arguments only (minimal YAML)
trtllm-serve "/models/gpt-oss-20b" \
--host 0.0.0.0 --port 8355 \
--max_batch_size 64 \
--max_seq_len 16384 \
--max_input_len 16384 \
--max_num_tokens 4096 \
--trust_remote_code
```
---
## Actual Behavior
### Initial Memory Profiling Stage
```
[12/16/2025-06:47:10] [TRT-LLM] [I] max_tokens is set by kv_cache_config.max_tokens: 4128
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.19 GiB for max tokens in paged KV cache (4128).
[12/16/2025-06:47:10] [TRT-LLM] [I] max_seq_len=4128, max_num_requests=64, max_num_tokens=4096, max_batch_size=64
```
Configured `max_tokens: 200000` is ignored and reduced to 4128.
### After Memory Profiling
```
[12/16/2025-06:47:17] [TRT-LLM] [I] Peak memory during memory usage profiling (torch + non-torch): 107.45 GiB
[12/16/2025-06:47:17] [TRT-LLM] [I] available KV cache memory when calculating max tokens: 11.20 GiB
[12/16/2025-06:47:17] [TRT-LLM] [I] fraction is set 0.9
[12/16/2025-06:47:17] [TRT-LLM] [I] device total memory 119.70 GiB
[12/16/2025-06:47:17] [TRT-LLM] [I] max_tokens is set by kv_cache_config.max_tokens: 244571
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.20 GiB for max tokens in paged KV cache (244576).
[12/16/2025-06:47:17] [TRT-LLM] [I] max_seq_len=16384, max_num_requests=64, max_num_tokens=4096, max_batch_size=64
```
After memory profiling, `max_tokens` is auto-adjusted to 244571 and configuration is partially applied.
### Final PyTorch Memory Allocation
```
[12/16/2025-06:47:20] [TRT-LLM] [I] Setting PyTorch memory fraction to 0.14019781316160024 (16.781333923339844 GiB)
```
**Issue**: PyTorch is limited to approximately 14% of total GPU memory.
---
## Expected Behavior
1. **KV Cache Memory Allocation**: Configured `max_tokens: 200000` or `free_gpu_memory_fraction: 0.8` should be accurately applied from the initial stage.
2. **PyTorch Memory**: Configuration values should be applied immediately without memory profiling, or at minimum maintain a fraction in the 0.8-0.9 range.
3. **Consistency**: CLI arguments and YAML settings should be applied without conflicts.
---
## Memory Analysis
### Current Memory Distribution
```
Total GPU Memory: 119.70 GB
Model Weights: 13.34 GB (inside torch)
CUDA Graphs/NCCL: 93.67 GB (outside torch)
KV Cache: 11.20 GB
Peak Memory: 107.45 GB
```
### Issues Identified
- **CUDA Graphs + NCCL**: 93.67GB (78% of total) consumed
- **PyTorch Limitation**: Restricted to 16.78GB (14%)
- **Available KV Cache**: 11.20GB (significantly less than target)
### Ideal Memory Distribution (80% KV cache target)
```
Model Weights: ~15 GB
KV Cache: ~96 GB (80% of 120GB)
Other Overhead: ~9 GB
Impact
-
Concurrent User Limitation:
-
Current: ~15 users (with 16K context)
-
Expected: ~47 users (with 80GB KV cache)
-
-
Maximum Sequence Length:
-
Configured: 16384 tokens
-
Actually Possible: 244576 tokens (after memory profiling)
-
However, may be unstable due to PyTorch limitation
-
-
Production RAG System: Performance degradation expected in multi-user concurrent processing scenarios
docker run \
--name trtllm_server --rm -it \
--gpus all --ipc=host --network host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v /home/coga/trtllm/models:/models:ro \
-v /home/coga/trtllm/encodings:/tiktoken:ro \
-v /home/coga/trtllm/engines:/engines \
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
bash