Hi,
I have a huge input matrix (~1.5GB) which is being accessed by each thread while computing each element of the output. Initially, I had the input data matrix as a global memory pointer but I switched to storing it in a texture as it’s cached and the access patterns are quite regular for groups of threads. However, I’m not seeing any improvement in performance as it’s taking the same time as the global memory version. Is there a way to check the cache hits, occupancy etc. so that I can track down the reason?
May I ask a related question here? Would someone, please, shed some light on how texture caching works? In fact I am trying to get an answer to the following questions, but the programming guide is not so clear about these:
If a cache miss occures, a read operation will follow. Does this read operation read exactly what missed, or it reads a block of memory and stores them all in the cache? If latter is the case, how much is the size of the block? What is the cache size, especially on 1.3 series? It seems that the cache is per thread, is it? Will it be cleared during a kernel launch?
I know the answer is probably “No”, but is it possible to bind the texture cache to 2 different addresses? I know even if possible, it may not be very effective, but it may help in certain cases.
These details are not discussed anywhere in official documentation. What is said (and is tested to work) to get the best performance is that you need to ensure that threads within a warp access values near each other in memory. And the texture cache is per SM, not thread: see the programming guide for the amount. Note that technically, there is a larger texture cache shared in a TPC of 2 or 3 SMs depending on the compute capability of the card.
?? Why do you think the answer is no? You can declare about as many texture references you want to. There is a limit, but I don’t recall the exact number (16, 24, or 32 maybe???). The forum search isn’t the greatest so I can’t find the NVIDIA post that confirms the limit.
Anyways, here are just a couple other threads on the texture cache. It has been discussed many times on these forums.
In fact, I am trying to optimize a kernel, in which the threads in a warp cooperate to do some calculation on a series of data. They need to read an ‘int’ index from an array, and then next warp will need the next element. I wanted to know if using texture cache I can almost guarantee that I will have several cache hit before a cache miss?
Reading through the threads you linked, it seems that nVidia does not like to talk about this.
I think I can adhere to this in my model discussed above. Do you think so?
Sorry, my mistake! I have read this in the manual.
Thanks for the informations. I need 2 or at most 3 linear textures, so I think I won’t hit the limit. Before trying to change the code, I am trying to find out if it is worth at all.