Runtime gating for transformer inference (75% layer reduction benchmark)

maestro.salah · March 11, 2026, 8:45pm

Hello,

I am an independent researcher working on inference efficiency mechanisms for transformer models.

I developed a lightweight runtime gating framework called Relational Time Engine (RTE). The idea is to dynamically stop layer execution when representational drift becomes small.

CPU benchmarks (8-layer transformer):

• up to 75% layer reduction
• ~40% latency reduction
• higher throughput
• bounded output drift

The system operates purely at runtime and requires no model retraining.

Repository

Whitepaper

Zenodo DOI

I would be interested in feedback from researchers working on inference optimization and GPU runtime scheduling.

AakankshaS · March 31, 2026, 11:25am

Hi @maestro.salah ,

Thank you for reaching out, however I am afraid that i might not be able to help you on this topic, because of my limited understanding around RTE.
However I will keep this discussion open for the community experts to respond.

Thank you

Topic		Replies	Views
Accelerated Inference for Large Transformer Models Using FasterTransformer and Triton Inference Server Technical Blog	1	627	August 10, 2023
Inference time using TF-TRT is the same as Native Tensorflow for Object Detection Models TensorRT tensorrt , tf-trt	4	1112	March 31, 2022
Tensorrt is slower than pytorch TensorRT	2	2368	September 15, 2021
Should pruning a model prior to converting it to tensorRT make inference faster? Jetson TX2 tensorrt	11	3078	September 21, 2020
Taking longer for inferencing even after TensorRT optimization TensorRT	3	483	May 28, 2020
Does network pruning speed up inference speed? TensorRT	6	1874	January 7, 2022
Low Compute utilization of converted TensorFlow model during inference Jetson TX2	18	1953	November 8, 2019
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	834	March 13, 2023
Replicate 2.2ms inference time on BERT TensorRT	2	966	October 22, 2019
TensorRT inference Time TensorRT	1	826	September 20, 2018

Runtime gating for transformer inference (75% layer reduction benchmark)

Related topics