I am trying to understand the usage of texture memory for 2D data access,
so according to literature, Texture memory in Cuda is optimal for 2D spatial locality, to verify that :
I created two simple kernels one using texture memory and another using global memory that just does an averaging filter 3 by 3 of of an image (512x512).
as follows :
now when I profiled this two kernels, here are my findings :
|Execution time||Texture||Global memory|
|2 ms||0.9 ms|
To further investigate the cache usage efficiency and global memory efficiency for global kernel, here are the following results :
|sm_efficiency Multiprocessor Activity||99.92%||99.80%|
|ipc Executed IPC||0.77||2.68|
|tex_cache_hit_rate Unified Cache Hit Rate||93.26%||62.47%|
|gld_efficiency Global Memory Load Efficiency||NA||61.46%|
What could be the reason for the low level of IPC in texture memory compared to global one, and the execution time difference ?