Hi everyone,
I’m an independent developer with a background in algorithms, HPC, and robotics infrastructure. Recently I’ve been working on a lightweight inference engine built around hand-written CUDA kernels, focusing on small-batch and real-time performance (especially for VLA and robotics workloads).
Here are some recent results on Thor and Blackwell:
-
Pi0.5 — Jetson AGX Thor (SM110): 44 ms (23 Hz)
-
Pi0 — Jetson AGX Thor (SM110): 46 ms (22 Hz)
-
Pi0.5 — RTX 5090 (SM120): 17.58 ms (57 Hz)
-
Pi0 — RTX 5090 (SM120): 18.43 / 21.16 / 24.48 ms (54 / 47 / 41 Hz)
-
GROOT N1.6 — Jetson AGX Thor: 45 ms (T=50) / 41 ms (T=16) → 22 / 24 Hz
-
GROOT N1.6 — RTX 5090: 13.08 ms (T=50) / 12.53 ms (T=16) → 76 / 80 Hz
-
Pi0-FAST (token)
-
Thor: 8.1 ms/token (123 tok/s)
-
RTX 5090: 2.39 ms/token (418 tok/s)
-
The focus is on pushing true real-time inference under small-batch settings, which tends to be underserved by typical large-batch optimized stacks.
Still early, but happy to share more details or discuss if anyone is working on similar workloads 🙂