[TensorRT-LLM 1.3.0rc1] TranslateGemma-27B-IT model fails with "Per-layer-type RoPE configuration is not supported yet"

Environment

  • TensorRT-LLM Version: 1.3.0rc1
  • Hardware: NVIDIA DGX Spark (128GB Unified Memory)
  • Model: google/translategemma-27b-it (Gemma 3 based translation model)

Issue Description

I’m trying to serve the TranslateGemma-27B-IT model using trtllm-serve, but it fails with a RoPE configuration error. I tested with two different model versions:

  1. Original BF16 model - downloaded directly from HuggingFace
  2. Quantized model - PTQ quantized using NVIDIA Model Optimizer (weights: NVFP4, KV cache: FP8)

Both versions produce the same error.

Command Used

trtllm-serve "/app/models/translategemma-27b-it_nvfp4_kv_fp8" \
  --host 0.0.0.0 --port 8355 \
  --extra_llm_api_options /config/extra-translategemma-27b-it-config.yml

Configuration File (extra-translategemma-27b-it-config.yml)

# TranslateGemma-27B-IT TensorRT-LLM Serving Configuration
# Target Hardware: NVIDIA DGX Spark (128GB Unified Memory)

# Logging
print_iter_log: false
trust_remote_code: true

# Batch settings
max_batch_size: 64

# Input/Output length (TranslateGemma spec: 2K input context)
max_input_len: 2048
max_seq_len: 4096
max_num_tokens: 8192

# KV Cache config
kv_cache_config:
  max_tokens: 131072
  use_uvm: true
  tokens_per_block: 64
  enable_block_reuse: true

# CUDA Graph
cuda_graph_config:
  enable_padding: true

# Scheduler
disable_overlap_scheduler: true
enable_chunked_prefill: true

Error Log

[TensorRT-LLM] TensorRT LLM version: 1.3.0rc1
[02/02/2026-03:12:19] [TRT-LLM] [W] Overriding kv_cache_config
[02/02/2026-03:12:19] [TRT-LLM] [I] Overriding max_batch_size from build_config to 64
[02/02/2026-03:12:19] [TRT-LLM] [I] Overriding max_num_tokens from build_config to 8192
[02/02/2026-03:12:19] [TRT-LLM] [I] Overriding max_seq_len from build_config to 4096
[02/02/2026-03:12:19] [TRT-LLM] [I] Using LLM with PyTorch backend
[02/02/2026-03:12:19] [TRT-LLM] [I] Found /app/models/translategemma-27b-it_nvfp4_kv_fp8/hf_quant_config.json, pre-quantized checkpoint is used.
[02/02/2026-03:12:19] [TRT-LLM] [I] Setting quant_algo=NVFP4 form HF quant config.
[02/02/2026-03:12:19] [TRT-LLM] [I] Setting kv_cache_quant_algo=FP8 form HF quant config.

[02/02/2026-03:12:39] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_loader.py", line 240, in load
    model = AutoModelForCausalLM.from_config(config_copy)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 57, in from_config
    model = cls(config)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_gemma3vl.py", line 208, in __init__
    self.llm = Gemma3ForCausalLM(llm_model_config)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_gemma3.py", line 265, in __init__
    super().__init__(Gemma3TextModel(model_config),
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_gemma3.py", line 212, in __init__
    Gemma3DecoderLayer(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_gemma3.py", line 134, in __init__
    self.self_attn = Gemma3Attention(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_gemma3.py", line 64, in __init__
    rope_params = RopeParams.from_config(config)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/attention_backend/interface.py", line 461, in from_config
    assert not set(hf_rope_parameters.keys()).issubset(
AssertionError: Per-layer-type RoPE configuration is not supported yet.

[02/02/2026-03:12:39] [TRT-LLM] [E] Failed to initialize executor on rank 0: Per-layer-type RoPE configuration is not supported yet.

RuntimeError: Executor worker returned error

Questions

  1. Is TranslateGemma-27B-IT (Gemma 3 based) currently supported in TensorRT-LLM 1.3.0rc1?
  2. Gemma 3 uses per-layer-type RoPE parameters (global attention vs sliding window attention). Is there a timeline for when this feature will be supported?
  3. Are there any workarounds to serve this model with TensorRT-LLM?

Thank you for your help!

This looks like a TensorRT-LLM issue. can you open a github issue on GitHub - NVIDIA/TensorRT-LLM: TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way. project?