I am trying to use textures to speed up the execution of an algorithm that does a lot of global memory access, but I am having quite a time debugging a numerical problem. In my code, I have implemented the two ways of fetching the data: global accesses and texture fetches using cudaBindTexture2D to bind to the global memory. Now when I fetch the data with the texture fetches I see a significant speed improvement that I am quite happy with, but the numerical result of the calculation changes slightly relative the the version that uses the global memory accesses. Thinking that I had implemented the texture fetches wrong, I debugged this under the device emulation but the results indicated that the data returned from the fetches was the same. Thinking that something was different in the hardware version of the code, I changed the code to fetch the data both using the global memory access and the texture fetch and then incrementing a counter if the numbers were different. When this was done, though, the counter never incremented (unless I changed the code in a way that ensured the data would be different) AND the results of the calculation matched my expectations. Furthermore, thinking that maybe somehow the linear memory was getting changed so that it was a cache coherence problem, I added the step of copying the data to a cudaArray and using this for the texture binding so that now I had two different copies of the data that I was comparing against. Unfortunately the problem stayed the same: if I only use the texture data the numbers don’t match, but if I use the texture data while additionally fetching the data from global memory then the numbers all match.
So the problem seems to be like some kind of weird quantum mechanical situation: if I use texture fetches and don’t look at the numbers individually, I don’t get the same result compared to the version with global memory accesses, but if I have the code do the additional work of fetching the number a second time via a global memory access, then I see no differences in the numbers either individually or in the final calculation result.
My question is: how do I debug such a situation?
Any help would be appreciated.