Texture Memory in Maxwell is slower than global memory?

Hello,

I am trying to understand the usage of texture memory for 2D data access,
so according to literature, Texture memory in Cuda is optimal for 2D spatial locality, to verify that :
I created two simple kernels one using texture memory and another using global memory that just does an averaging filter 3 by 3 of of an image (512x512).
as follows :

now when I profiled this two kernels, here are my findings :

Execution time Texture Global memory
2 ms 0.9 ms

To further investigate the cache usage efficiency and global memory efficiency for global kernel, here are the following results :

Texture Global
sm_efficiency Multiprocessor Activity 99.92% 99.80%
achieved_occupancy 0.96 0.88
ipc Executed IPC 0.77 2.68
tex_cache_hit_rate Unified Cache Hit Rate 93.26% 62.47%
gld_efficiency Global Memory Load Efficiency NA 61.46%

What could be the reason for the low level of IPC in texture memory compared to global one, and the execution time difference ?