TensorRT-LLM PyTorch Memory Fraction Automatically Limited to 0.14 Despite KV Cache Configuration

When serving the GPT-OSS-20B model using TensorRT-LLM’s trtllm-serve command, the PyTorch memory fraction is automatically limited to 0.14 (14%) despite explicitly configuring KV cache memory through both YAML configuration files and CLI arguments.

Environment

  1. Platform: NVIDIA DGX Spark / GB10
  2. GPU: RTX 6000 ADA (120GB VRAM)
  3. Container Image: nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
  4. TensorRT-LLM Version: 1.1.0rc3
  5. PyTorch Version: 2.8.0a0+5228986c39.nv25.6
  6. Model: openai/gpt-oss-20b (FP8 quantized)

1. Create YAML Configuration File

cat > /tmp/extra-llm-api-config.yml <<'EOF'
print_iter_log: false
max_batch_size: 64
max_seq_len: 16384
max_input_len: 16384
max_num_tokens: 4096
enable_chunked_prefill: true

kv_cache_config:
  dtype: "auto"
  max_tokens: 200000
  enable_block_reuse: true
  onboard_blocks: true

cuda_graph_config:
  enable_padding: true

disable_overlap_scheduler: true
enable_iter_perf_stats: true
EOF

2. Launch Server

trtllm-serve "/models/gpt-oss-20b" \
  --host 0.0.0.0 \
  --port 8355 \
  --max_batch_size 64 \
  --max_seq_len 16384 \
  --max_input_len 16384 \
  --max_num_tokens 4096 \
  --trust_remote_code \
  --extra_llm_api_options /tmp/extra-llm-api-config.yml

3. Alternative Configurations Tested

Attempt A: Using free_gpu_memory_fraction

kv_cache_config:
  dtype: "auto"
  free_gpu_memory_fraction: 0.8

Attempt B: Using max_gpu_total_bytes

kv_cache_config:
  dtype: "auto"
  max_gpu_total_bytes: 85899345920  # 80GB

Attempt C: CLI arguments only (minimal YAML)

trtllm-serve "/models/gpt-oss-20b" \
  --host 0.0.0.0 --port 8355 \
  --max_batch_size 64 \
  --max_seq_len 16384 \
  --max_input_len 16384 \
  --max_num_tokens 4096 \
  --trust_remote_code
```

---

## Actual Behavior

### Initial Memory Profiling Stage
```
[12/16/2025-06:47:10] [TRT-LLM] [I] max_tokens is set by kv_cache_config.max_tokens: 4128
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.19 GiB for max tokens in paged KV cache (4128).
[12/16/2025-06:47:10] [TRT-LLM] [I] max_seq_len=4128, max_num_requests=64, max_num_tokens=4096, max_batch_size=64
```

Configured `max_tokens: 200000` is ignored and reduced to 4128.

### After Memory Profiling
```
[12/16/2025-06:47:17] [TRT-LLM] [I] Peak memory during memory usage profiling (torch + non-torch): 107.45 GiB
[12/16/2025-06:47:17] [TRT-LLM] [I] available KV cache memory when calculating max tokens: 11.20 GiB
[12/16/2025-06:47:17] [TRT-LLM] [I] fraction is set 0.9
[12/16/2025-06:47:17] [TRT-LLM] [I] device total memory 119.70 GiB
[12/16/2025-06:47:17] [TRT-LLM] [I] max_tokens is set by kv_cache_config.max_tokens: 244571
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.20 GiB for max tokens in paged KV cache (244576).
[12/16/2025-06:47:17] [TRT-LLM] [I] max_seq_len=16384, max_num_requests=64, max_num_tokens=4096, max_batch_size=64
```

After memory profiling, `max_tokens` is auto-adjusted to 244571 and configuration is partially applied.

### Final PyTorch Memory Allocation
```
[12/16/2025-06:47:20] [TRT-LLM] [I] Setting PyTorch memory fraction to 0.14019781316160024 (16.781333923339844 GiB)
```

**Issue**: PyTorch is limited to approximately 14% of total GPU memory.

---

## Expected Behavior

1. **KV Cache Memory Allocation**: Configured `max_tokens: 200000` or `free_gpu_memory_fraction: 0.8` should be accurately applied from the initial stage.

2. **PyTorch Memory**: Configuration values should be applied immediately without memory profiling, or at minimum maintain a fraction in the 0.8-0.9 range.

3. **Consistency**: CLI arguments and YAML settings should be applied without conflicts.

---

## Memory Analysis

### Current Memory Distribution
```
Total GPU Memory:     119.70 GB
Model Weights:         13.34 GB (inside torch)
CUDA Graphs/NCCL:     93.67 GB (outside torch)
KV Cache:             11.20 GB
Peak Memory:         107.45 GB
```

### Issues Identified
- **CUDA Graphs + NCCL**: 93.67GB (78% of total) consumed
- **PyTorch Limitation**: Restricted to 16.78GB (14%)
- **Available KV Cache**: 11.20GB (significantly less than target)

### Ideal Memory Distribution (80% KV cache target)
```
Model Weights:     ~15 GB
KV Cache:          ~96 GB (80% of 120GB)
Other Overhead:     ~9 GB

Impact

  1. Concurrent User Limitation:

    • Current: ~15 users (with 16K context)

    • Expected: ~47 users (with 80GB KV cache)

  2. Maximum Sequence Length:

    • Configured: 16384 tokens

    • Actually Possible: 244576 tokens (after memory profiling)

    • However, may be unstable due to PyTorch limitation

  3. Production RAG System: Performance degradation expected in multi-user concurrent processing scenarios

docker run \
  --name trtllm_server --rm -it \
  --gpus all --ipc=host --network host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v /home/coga/trtllm/models:/models:ro \
  -v /home/coga/trtllm/encodings:/tiktoken:ro \
  -v /home/coga/trtllm/engines:/engines \
  nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
  bash