my kernel does some very complex computation and I’m asking myself where to start looking for possible optimizations. Letting it pass through the Visual Profiler, I get the following result:
That is, GPU time is 99.99%, 23.040/15.360 uncoalesced stores/loads, 98292 branches and 224 million instructions. All of that for basically a single result vector.
From my unterstanding, the uncoalesced memory access shouldn’t affect performance at all (they are caused by copying input vectors from global to shared memory). Given the number of instructions, the kernel is cpu- and not memory-bound so dealing with better memory access should not be the first step. Am I right? How to put these numbers into perspective with each other?