How L2 persistant slices combines with shared activation memory buffer?

Hi! I am learning paper published by NVIDIA: AUTOSCRATCH: ML-OPTIMIZED CACHE MANAGEMENT FOR INFERENCE-ORIENTED GPUS

Like here:


I am wondering how this is implemented? Or does anyone know how shared activation memory buffer is implemented in TensorRT?? Thank you!!!

image