Executed IPC

I run my program with 4 and 8 threads and see that the latter takes longer time to execute. I mean the clock wall time…

However when I profile them with nvprof, I see that for the 8 threads scenario, most of the kernels have higher executed IPC than 4 thread scenario. For example 3.42 vs. 3.36

So, slower scenario have better IPC! How that can be explained? They are both running on one system.