Runtime gating for transformer inference (75% layer reduction benchmark)

Hello,

I am an independent researcher working on inference efficiency mechanisms for transformer models.

I developed a lightweight runtime gating framework called Relational Time Engine (RTE). The idea is to dynamically stop layer execution when representational drift becomes small.

CPU benchmarks (8-layer transformer):

• up to 75% layer reduction
• ~40% latency reduction
• higher throughput
• bounded output drift

The system operates purely at runtime and requires no model retraining.

Repository

Whitepaper

Zenodo DOI

I would be interested in feedback from researchers working on inference optimization and GPU runtime scheduling.

Hi @maestro.salah ,

Thank you for reaching out, however I am afraid that i might not be able to help you on this topic, because of my limited understanding around RTE.
However I will keep this discussion open for the community experts to respond.

Thank you