Description
I am engineering a sovereign compute node (running an RTX 6000 Ada) that executes local LLMs (via Ollama/custom wrappers). The challenge is that probabilistic AI cannot be trusted to actuate physical or financial systems directly.
We are routing the LLM’s output (an “Aspiration”) directly into a custom deterministic C++ compiler (the Resin DSL). This compiler checks the intent against strict finite state machine rules before allowing execution.
Environment
TensorRT Version: TensorRT-LLM v0.15.0 (C++ API)
GPU Type: RTX 6000 Ada Generation (48GB VRAM)
Nvidia Driver Version: 560.35
CUDA Version: 12.6
CUDNN Version: 9.3.0
Operating System + Version: Ubuntu 24.04 LTS (Custom PREEMPT_RT patched kernel)
Python Version (if applicable): 3.12 (Build environment only, executing purely in C++)
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): 2.5.1 (Used only for initial ONNX/Engine export)
Baremetal or Container (if container which image + tag): Baremetal (We are avoiding containers to reduce IPC overhead for hardware interrupts).
Relevant Files
We are implementing a custom ILogitsProcessor in C++ to act as a deterministic Finite State Machine (our “Resin DSL” compiler). Conceptual snippet of our bottleneck: amber_zone_logits_mask.cpp (Modifying the logits pointer directly in device memory before the sampler step to enforce strict compiler syntax).
Steps To Reproduce
Export a local LLaMA-3 (8B) model to a TensorRT engine using trtllm-build with --use_custom_all_reduce disable.Initialize the TensorRT-LLM C++ runtime (tensorrt_llm::runtime::GptSession).
Inject a custom ILogitsProcessor callback designed to physically mask out any tokens that violate our deterministic hardware compiler syntax (Resin DSL).
Please include:
- Exact steps/commands to build your repro
- Exact steps/commands to run your repro
- Full traceback of errors encountered
-
Pass a prompt requesting a physical hardware actuation command.
-
The
ILogitsProcessorintercepts the logits tensor on the device (*logits_ptr). -
We apply a zero-copy bitmask to penalize/ban non-compliant syntax tokens before the sampling phase.
Full traceback of errors encountered / The Core Issue: No crash traceback occurs. The issue is latency and pipeline starvation.
When we attempt to evaluate the logits against our C++ Finite State Machine directly in device memory (to avoid the latency of copying the logits back to the host CPU for Python-level evaluation), we are seeing a 15-20ms penalty per token. In a real-time robotics environment awaiting a physical hardware interrupt, this latency stacks up and causes buffer overruns in our concurrent sensor-fusion pipelines.
The Ask: What is the most optimized, zero-copy method within the TensorRT-LLM C++ API to apply a strict deterministic mask to the output logits without halting the CUDA stream and stalling the RTX 6000 Ada?