When I run my code in NVVP (6.0), pick my most-time-consuming kernel and run the ‘Kernel Compute’ profile on it, the instruction execution counts chart is baffling me: It shows an absolute majority of instructions executed are “Misc.”
Once you take out FP math, integer math, flow control, ld/st/cast, and idle threads running NOOPs… What else is this miscellany that’s taking up half the time?