some questions about local memory misses and texture use

these days i have some headach questions on the global memory bandwidth. My device is Tesla C2050 os:win7 64bit

when i use texture memory, the Computer visual profiler report

       <b>Achieved global memory throughput:  109.46 ( Peak global memory throughput(GB/s):  144.00 )</b>

but the time cost is higher than in the case without texture, more important is that the compute visual profiler report that:

             <b> The replays due to local memory cache misses are high</b>

if i use global memory directly, the time cost is lower and there is no local memory cache misses but the global memory bandwidth is very low comparing with peak value.

Is there something i can do to improve the global memory bandwith,causing i think i did use coalescing type to access global memory.

some code below: