How is it possible that local_memory_overhead from nvprof is 149%?
It is a complex persistent kernel that I am profiling on a Jetson TX2. I previously profiled it on GeForce 1050 Ti and got local_memory_overhead 30% or so. Could it be because there is a long loop in this kernel that doesn’t fit in instruction pipeline, so it is re-read from local memory on every cycle?