Hello,
I am an independent researcher working on inference efficiency mechanisms for transformer models.
I developed a lightweight runtime gating framework called Relational Time Engine (RTE). The idea is to dynamically stop layer execution when representational drift becomes small.
CPU benchmarks (8-layer transformer):
• up to 75% layer reduction
• ~40% latency reduction
• higher throughput
• bounded output drift
The system operates purely at runtime and requires no model retraining.
Repository
Whitepaper
Zenodo DOI
I would be interested in feedback from researchers working on inference optimization and GPU runtime scheduling.