I have a huge input matrix (~1.5GB) which is being accessed by each thread while computing each element of the output. Initially, I had the input data matrix as a global memory pointer but I switched to storing it in a texture as it’s cached and the access patterns are quite regular for groups of threads. However, I’m not seeing any improvement in performance as it’s taking the same time as the global memory version. Is there a way to check the cache hits, occupancy etc. so that I can track down the reason?