Local memory overhead: 149%

How is it possible that local_memory_overhead from nvprof is 149%?

It is a complex persistent kernel that I am profiling on a Jetson TX2. I previously profiled it on GeForce 1050 Ti and got local_memory_overhead 30% or so. Could it be because there is a long loop in this kernel that doesn’t fit in instruction pipeline, so it is re-read from local memory on every cycle?

If it is reasonable on a 1050Ti but not on a TX2, then I believe the guys at TX2 forum will provide better help:
https://devtalk.nvidia.com/default/board/188/jetson-tx2/