According to the answer of @Greg Smith, the instruction overhead of can be counted as follwing events and metrics
events: global_ld_mem_divergence_replays global_st_mem_divergence_replays shared_load_replay shared_store_replay metrics: atomic_replay_overhead global_cache_replay_overhead global_replay_overhead local_replay_overhead shared_replay_overhead
But to my surprise is that those events and metrics are zero in my nvprof results while the instruction replay overhead is pretty high.
Below are two simple vector addition benchmarks, the first one uses global memory while the other use constant memory.
Nvprof results could be found here
References from Greg Smith