Texture memory is a part of global memory. It is cached and read-only. But for a 2D stencil heat problem, a lot of literature suggests using texture memory. The time taken (gpu__time_duration.sum) decreases on using texture memory, but the memory throughput increases too (dram__bytes.sum.per_second). Why will the memory throughput increase? It is accessing data from the slower global memory at the end, so why will the throughput increase? I can not find any literature pointing to it.
The way textures are handled and stored in memory is optimized for data locality of your neighbors, which is different than a normal 2D array. I don’t believe the specifics of how this is done is all public information but just knowing that they are stored differently can explain why performance differs. In theory, if you could optimize the memory access pattern using global memory, you could get the same performance. It’s most likely just a function of the different ways textures vs arrays are stored in memory.