Bypassing Python: Piping local LLM inference directly into a deterministic C++ compiler pipeline

johngreetme · February 27, 2026, 11:34pm

Description

I am engineering a sovereign compute node (running an RTX 6000 Ada) that executes local LLMs (via Ollama/custom wrappers). The challenge is that probabilistic AI cannot be trusted to actuate physical or financial systems directly.

We are routing the LLM’s output (an “Aspiration”) directly into a custom deterministic C++ compiler (the Resin DSL). This compiler checks the intent against strict finite state machine rules before allowing execution.

Environment

TensorRT Version: TensorRT-LLM v0.15.0 (C++ API)
GPU Type: RTX 6000 Ada Generation (48GB VRAM)
Nvidia Driver Version: 560.35
CUDA Version: 12.6
CUDNN Version: 9.3.0
Operating System + Version: Ubuntu 24.04 LTS (Custom PREEMPT_RT patched kernel)
Python Version (if applicable): 3.12 (Build environment only, executing purely in C++)
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): 2.5.1 (Used only for initial ONNX/Engine export)
Baremetal or Container (if container which image + tag): Baremetal (We are avoiding containers to reduce IPC overhead for hardware interrupts).

Relevant Files

We are implementing a custom ILogitsProcessor in C++ to act as a deterministic Finite State Machine (our “Resin DSL” compiler). Conceptual snippet of our bottleneck: amber_zone_logits_mask.cpp (Modifying the logits pointer directly in device memory before the sampler step to enforce strict compiler syntax).

Steps To Reproduce

Export a local LLaMA-3 (8B) model to a TensorRT engine using trtllm-build with --use_custom_all_reduce disable.

Initialize the TensorRT-LLM C++ runtime (tensorrt_llm::runtime::GptSession).
Inject a custom ILogitsProcessor callback designed to physically mask out any tokens that violate our deterministic hardware compiler syntax (Resin DSL).

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

Pass a prompt requesting a physical hardware actuation command.
The ILogitsProcessor intercepts the logits tensor on the device (*logits_ptr).
We apply a zero-copy bitmask to penalize/ban non-compliant syntax tokens before the sampling phase.

Full traceback of errors encountered / The Core Issue: No crash traceback occurs. The issue is latency and pipeline starvation.

When we attempt to evaluate the logits against our C++ Finite State Machine directly in device memory (to avoid the latency of copying the logits back to the host CPU for Python-level evaluation), we are seeing a 15-20ms penalty per token. In a real-time robotics environment awaiting a physical hardware interrupt, this latency stacks up and causes buffer overruns in our concurrent sensor-fusion pipelines.

The Ask: What is the most optimized, zero-copy method within the TensorRT-LLM C++ API to apply a strict deterministic mask to the output logits without halting the CUDA stream and stalling the RTX 6000 Ada?

athkumar · February 28, 2026, 12:22pm

Hey @johngreetme ,

This is an amazing use case ! Thanks for Posting.

I am leaving this thread open for TensorRT Community experts to chime in.

Best Regards,
Atharva

johngreetme · March 1, 2026, 8:52pm

Thank you athkumar,i think to maintain the 100Hz Heartbeat of the State-Locked Protocol without starvation, I may need to bypass the standard TensorRT-LLM callback entirely. Instead, i’ll write a custom CUDA kernel that applies my DSL syntax bitmask to the logits tensor asynchronously, fusing it directly into the customized top-K/top-P sampling kernel before TensorRT hands control back.

Why this could be infinitely faster: We launch 128,000 threads simultaneously. It takes less than 0.05ms for the RTX 6000 Ada to execute this across its 18,000 CUDA cores.

Topic		Replies	Views
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	2092	January 25, 2024
TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x Technical Blog	3	245	January 9, 2025
TenorRT-LLM cpp_llm_only runs slower than python session mode TensorRT	0	50	March 9, 2025
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	2	312	February 3, 2025
Adapt a new model with a structure similar to LLaMA3 TensorRT tensorrt , llm	1	90	June 30, 2025
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Technical Blog	61	4741	August 28, 2024
Supercharging Llama 3.1 across NVIDIA Platforms Technical Blog	13	466	September 17, 2024
Easier. Faster. Open. TensorRT LLM 1.0 Announcements	0	106	September 25, 2025
TensorRT-LLM for Jetson Announcements generative_ai	0	354	November 13, 2024
TRT LLM for Inference with NVFP4 safetensors slower than LM studio GGUF on the Spark DGX Spark / GB10 tensorrt , llm , llama	9	1382	March 6, 2026

Bypassing Python: Piping local LLM inference directly into a deterministic C++ compiler pipeline

Description

Environment

Relevant Files

Steps To Reproduce

Related topics