Computation intensive kernel Optimization ideas

Hello people,

       I have a highly computation intensive kernel which applies a filter on an image( filter size = 13 X 13 ). Current instruction to byte ratio is around 24. I calculated computation and memory operation time separately and found that total time spend on kernel is only slightly larger than the computation only kernel time (overlap is pretty good ). But current occupancy of kernel is very low.. only around 0.2. With all this profiling info does it mean that 
  1. What ever parallelism i bring in to increase occupancy, the total time will always be around the current computation only time, since memory latency is already well hidden?
  2. Does a better occupancy help in hiding arithmetic latency, or its already hidden with 0.2 occupancy ( Cuda best practices guide )
    3)Which method is better. Do complete 13 X 13 filter on a common loaded data and not bother abt the instruction: byte ratio or split the kernel in to multiple blocks , which means each block will have to loads same data and apply different filter?
    Looking forward to what u guys think.


The visual profiler can tell you what is limiting the occupancy.
Check Session\analyze occupancy

Shared memory size/block, to much thread per block, or to much register useage can produce low occupancy.