Bypassing Python: Piping local LLM inference directly into a deterministic C++ compiler pipeline

Description

I am engineering a sovereign compute node (running an RTX 6000 Ada) that executes local LLMs (via Ollama/custom wrappers). The challenge is that probabilistic AI cannot be trusted to actuate physical or financial systems directly.

We are routing the LLM’s output (an “Aspiration”) directly into a custom deterministic C++ compiler (the Resin DSL). This compiler checks the intent against strict finite state machine rules before allowing execution.

Environment

TensorRT Version: TensorRT-LLM v0.15.0 (C++ API)
GPU Type: RTX 6000 Ada Generation (48GB VRAM)
Nvidia Driver Version: 560.35
CUDA Version: 12.6
CUDNN Version: 9.3.0
Operating System + Version: Ubuntu 24.04 LTS (Custom PREEMPT_RT patched kernel)
Python Version (if applicable): 3.12 (Build environment only, executing purely in C++)
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): 2.5.1 (Used only for initial ONNX/Engine export)
Baremetal or Container (if container which image + tag): Baremetal (We are avoiding containers to reduce IPC overhead for hardware interrupts).

Relevant Files

We are implementing a custom ILogitsProcessor in C++ to act as a deterministic Finite State Machine (our “Resin DSL” compiler). Conceptual snippet of our bottleneck: amber_zone_logits_mask.cpp (Modifying the logits pointer directly in device memory before the sampler step to enforce strict compiler syntax).

Steps To Reproduce

Export a local LLaMA-3 (8B) model to a TensorRT engine using trtllm-build with --use_custom_all_reduce disable.

Initialize the TensorRT-LLM C++ runtime (tensorrt_llm::runtime::GptSession).
Inject a custom ILogitsProcessor callback designed to physically mask out any tokens that violate our deterministic hardware compiler syntax (Resin DSL).

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered
  1. Pass a prompt requesting a physical hardware actuation command.

  2. The ILogitsProcessor intercepts the logits tensor on the device (*logits_ptr).

  3. We apply a zero-copy bitmask to penalize/ban non-compliant syntax tokens before the sampling phase.

Full traceback of errors encountered / The Core Issue: No crash traceback occurs. The issue is latency and pipeline starvation.

When we attempt to evaluate the logits against our C++ Finite State Machine directly in device memory (to avoid the latency of copying the logits back to the host CPU for Python-level evaluation), we are seeing a 15-20ms penalty per token. In a real-time robotics environment awaiting a physical hardware interrupt, this latency stacks up and causes buffer overruns in our concurrent sensor-fusion pipelines.

The Ask: What is the most optimized, zero-copy method within the TensorRT-LLM C++ API to apply a strict deterministic mask to the output logits without halting the CUDA stream and stalling the RTX 6000 Ada?

1 Like

Hey @johngreetme ,

This is an amazing use case ! Thanks for Posting.

I am leaving this thread open for TensorRT Community experts to chime in.

Best Regards,
Atharva

Thank you athkumar,i think to maintain the 100Hz Heartbeat of the State-Locked Protocol without starvation, I may need to bypass the standard TensorRT-LLM callback entirely. Instead, i’ll write a custom CUDA kernel that applies my DSL syntax bitmask to the logits tensor asynchronously, fusing it directly into the customized top-K/top-P sampling kernel before TensorRT hands control back.

Why this could be infinitely faster: We launch 128,000 threads simultaneously. It takes less than 0.05ms for the RTX 6000 Ada to execute this across its 18,000 CUDA cores.