Environment
- TensorRT-LLM Version: 1.3.0rc1
- Hardware: NVIDIA DGX Spark (128GB Unified Memory)
- Model: google/translategemma-27b-it (Gemma 3 based translation model)
Issue Description
I’m trying to serve the TranslateGemma-27B-IT model using trtllm-serve, but it fails with a RoPE configuration error. I tested with two different model versions:
- Original BF16 model - downloaded directly from HuggingFace
- Quantized model - PTQ quantized using NVIDIA Model Optimizer (weights: NVFP4, KV cache: FP8)
Both versions produce the same error.
Command Used
trtllm-serve "/app/models/translategemma-27b-it_nvfp4_kv_fp8" \
--host 0.0.0.0 --port 8355 \
--extra_llm_api_options /config/extra-translategemma-27b-it-config.yml
Configuration File (extra-translategemma-27b-it-config.yml)
# TranslateGemma-27B-IT TensorRT-LLM Serving Configuration
# Target Hardware: NVIDIA DGX Spark (128GB Unified Memory)
# Logging
print_iter_log: false
trust_remote_code: true
# Batch settings
max_batch_size: 64
# Input/Output length (TranslateGemma spec: 2K input context)
max_input_len: 2048
max_seq_len: 4096
max_num_tokens: 8192
# KV Cache config
kv_cache_config:
max_tokens: 131072
use_uvm: true
tokens_per_block: 64
enable_block_reuse: true
# CUDA Graph
cuda_graph_config:
enable_padding: true
# Scheduler
disable_overlap_scheduler: true
enable_chunked_prefill: true
Error Log
[TensorRT-LLM] TensorRT LLM version: 1.3.0rc1
[02/02/2026-03:12:19] [TRT-LLM] [W] Overriding kv_cache_config
[02/02/2026-03:12:19] [TRT-LLM] [I] Overriding max_batch_size from build_config to 64
[02/02/2026-03:12:19] [TRT-LLM] [I] Overriding max_num_tokens from build_config to 8192
[02/02/2026-03:12:19] [TRT-LLM] [I] Overriding max_seq_len from build_config to 4096
[02/02/2026-03:12:19] [TRT-LLM] [I] Using LLM with PyTorch backend
[02/02/2026-03:12:19] [TRT-LLM] [I] Found /app/models/translategemma-27b-it_nvfp4_kv_fp8/hf_quant_config.json, pre-quantized checkpoint is used.
[02/02/2026-03:12:19] [TRT-LLM] [I] Setting quant_algo=NVFP4 form HF quant config.
[02/02/2026-03:12:19] [TRT-LLM] [I] Setting kv_cache_quant_algo=FP8 form HF quant config.
[02/02/2026-03:12:39] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_loader.py", line 240, in load
model = AutoModelForCausalLM.from_config(config_copy)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 57, in from_config
model = cls(config)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_gemma3vl.py", line 208, in __init__
self.llm = Gemma3ForCausalLM(llm_model_config)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_gemma3.py", line 265, in __init__
super().__init__(Gemma3TextModel(model_config),
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_gemma3.py", line 212, in __init__
Gemma3DecoderLayer(
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_gemma3.py", line 134, in __init__
self.self_attn = Gemma3Attention(
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_gemma3.py", line 64, in __init__
rope_params = RopeParams.from_config(config)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/attention_backend/interface.py", line 461, in from_config
assert not set(hf_rope_parameters.keys()).issubset(
AssertionError: Per-layer-type RoPE configuration is not supported yet.
[02/02/2026-03:12:39] [TRT-LLM] [E] Failed to initialize executor on rank 0: Per-layer-type RoPE configuration is not supported yet.
RuntimeError: Executor worker returned error
Questions
- Is TranslateGemma-27B-IT (Gemma 3 based) currently supported in TensorRT-LLM 1.3.0rc1?
- Gemma 3 uses per-layer-type RoPE parameters (global attention vs sliding window attention). Is there a timeline for when this feature will be supported?
- Are there any workarounds to serve this model with TensorRT-LLM?
Thank you for your help!