Kernel is bound by instruction and memory latency but achieved high occupancy

boq · August 1, 2020, 3:28pm

Hi,

I am looking into a simple filter kernel that reads a 7x7 region from an image, and compute an average. There are some if statements before each read to check the memory boundary. I am confused on some analysis i saw from NVVP.

I ran on a GTX680 with block size of 128x1. When the input image is 4096 by 4096, NVVP says the kernel performance is bound by instruction and memory latency. The first thing i look into is the occupancy: Since the kernel uses 32 registers per thread and no smem, the theoretical occupancy is 100%. The achieved occupancy is also quite good, 93%. My understanding from the occupancy result is that there are enough Active warps. So the second thing i look into is how may eligible warps can be issued from those active warps. Nvprof tells me eligible_warps_per_cycle is 7.27. As far as i understand this suggests there are enough eligible warps to be issued.

Here are some other metrics i have:
issue_slot_utilization 35.3% (nvprof)
ldst_fu_utilization Mid 6
alu_fu_utilization Mid 6
cf_fu_utilization Low 1

What i am still confused is: why it is bound by latency when my indicators suggest that there are enough warps to hide the latency? Please let me know which concept did i misunderstood or which part did I miss?

I have read this topic but still confused on the issue.

Thanks in advance,

Topic		Replies	Views
Kernel bound by instruction and memory latency. CUDA Programming and Performance	3	1912	November 24, 2017
Visual Profiler says my occupancy is 221% CUDA Programming and Performance	4	1768	April 14, 2013
Computation intensive kernel Optimization ideas CUDA Programming and Performance	1	816	October 23, 2011
Exact meaning of "occupancy" Slightly confused CUDA Programming and Performance	2	2271	April 20, 2009
Kernel Occupancy Could someone explain this? CUDA Programming and Performance	1	11882	March 19, 2010
better performance from underpopulated warps CUDA Programming and Performance	6	2439	June 28, 2008
Question about NVIDIA Visual Profiler's occupancy results CUDA Programming and Performance	2	977	May 29, 2019
Why not full occupancy? CUDA Programming and Performance	2	983	November 17, 2012
For GEMM with large K, and not use sliceK, why the occupancy is low? CUDA Programming and Performance	2	662	June 27, 2022
Kernel with very low eligible warps despite fully coalesced memory access CUDA Programming and Performance	7	1049	July 17, 2023

Kernel is bound by instruction and memory latency but achieved high occupancy

Related topics