Kernel is bound by instruction and memory latency but achieved high occupancy


I am looking into a simple filter kernel that reads a 7x7 region from an image, and compute an average. There are some if statements before each read to check the memory boundary. I am confused on some analysis i saw from NVVP.

I ran on a GTX680 with block size of 128x1. When the input image is 4096 by 4096, NVVP says the kernel performance is bound by instruction and memory latency. The first thing i look into is the occupancy: Since the kernel uses 32 registers per thread and no smem, the theoretical occupancy is 100%. The achieved occupancy is also quite good, 93%. My understanding from the occupancy result is that there are enough Active warps. So the second thing i look into is how may eligible warps can be issued from those active warps. Nvprof tells me eligible_warps_per_cycle is 7.27. As far as i understand this suggests there are enough eligible warps to be issued.

Here are some other metrics i have:
issue_slot_utilization 35.3% (nvprof)
ldst_fu_utilization Mid 6
alu_fu_utilization Mid 6
cf_fu_utilization Low 1

What i am still confused is: why it is bound by latency when my indicators suggest that there are enough warps to hide the latency? Please let me know which concept did i misunderstood or which part did I miss?

I have read this topic but still confused on the issue.

Thanks in advance,