Real-Time Inference on Thor & RTX Pi0.5/GR00T N1.6/1.7 Thor 23 Hz RTX 5090 50-80Hz

7thuniversels · May 2, 2026, 7:31pm

Hi everyone,

I’m an independent developer with a background in algorithms, HPC, and robotics infrastructure. Recently I’ve been working on a lightweight inference engine built around hand-written CUDA kernels, focusing on small-batch and real-time performance (especially for VLA and robotics workloads).

Here are some recent results on Thor and Blackwell:

Pi0.5 — Jetson AGX Thor (SM110): 44 ms (23 Hz)
Pi0 — Jetson AGX Thor (SM110): 46 ms (22 Hz)
Pi0.5 — RTX 5090 (SM120): 17.58 ms (57 Hz)
Pi0 — RTX 5090 (SM120): 18.43 / 21.16 / 24.48 ms (54 / 47 / 41 Hz)
GROOT N1.6 — Jetson AGX Thor: 45 ms (T=50) / 41 ms (T=16) → 22 / 24 Hz
GROOT N1.6 — RTX 5090: 13.08 ms (T=50) / 12.53 ms (T=16) → 76 / 80 Hz
Pi0-FAST (token)
- Thor: 8.1 ms/token (123 tok/s)
- RTX 5090: 2.39 ms/token (418 tok/s)

The focus is on pushing true real-time inference under small-batch settings, which tends to be underserved by typical large-batch optimized stacks.

Still early, but happy to share more details or discuss if anyone is working on similar workloads 🙂

Feeback welcome！：GitHub - LiangSu8899/FlashRT: FlashRT is a high-performance realtime inference engine for small-batch, latency-sensitive AI workloads. The flagship integration is production VLA control for Pi0, Pi0.5, GROOT N1.6, and Pi0-FAST · GitHub

AastaLLL · May 4, 2026, 5:43am

Hi,

Thanks a lot for sharing.

We also have a pi 0.5 tutorial below for your reference:

Thanks.

7thuniversels · May 4, 2026, 6:33am

Thanks a lot for sharing the example — it’s a really helpful reference and gave me a lot of inspiration.

I also tried building a version using MLIR-TRT (mainly to better support both JAX and PyTorch):
https://github.com/LiangSu8899/openpi-jax_torch_mlir-trt/tree/main/deployment

With the flexibility of graph optimization on MLIR, I was able to get the TRT-based Pi0.5 down to around 70 ms.

After that, I experimented further with custom CUDA kernels to remove Q/DQ from the pipeline, and in my testing this brings a significant improvement on edge devices.

Really appreciate the work from your side — it’s been very helpful in exploring these optimizations.

Topic		Replies	Views
Inference is so slow with torch1.6 Jetson Xavier NX nvbugs , pytorch	12	3758	October 23, 2020
How to run a pi0 or pi0.5 model on AGX THOR device Jetson Thor cuda	5	297	March 9, 2026
Poor Inference Time on Jetson TX1 Jetson TX1 jetson-inference	4	857	June 21, 2022
How to run TensorRT based deep learning model as real time? TensorRT	4	1464	July 18, 2019
Jetson AGX Xavier shows unstable inference time Jetson AGX Xavier tensorrt , jetson-inference	6	802	October 18, 2021
TRT inference speed on two AGX Xavier TensorRT	1	356	September 12, 2021
Geforce RTX 3090 versus Jetson AGX Xavier for inference in AI TensorRT jetson-inference	3	1194	October 12, 2021
Optimization using Inference batch size General Topics & Other SDKs	1	1069	January 19, 2022
Inference time on Jetson Xavier compared with local host PC？ Jetson AGX Xavier	8	932	October 18, 2021
Performance analysis on Jetson Orin Nano 8GB Jetson Nano cudnn	2	636	June 4, 2024

Real-Time Inference on Thor & RTX Pi0.5/GR00T N1.6/1.7 Thor 23 Hz RTX 5090 50-80Hz

Related topics