I’m writing a red/black Gauss Seidel FD implementation to demonstrate the use of CUDA. I’ve written both versions for texture memory and global memory (passing data to shared memory first). However even if I make use of coalesced memory accesses the texture version seems to be faster. Is there any possible explanation for this? Additionally there are not bank conflicts on shared memory accesses.
I’ll assume you’re getting quite a lot of cache hits (or, your entire tree fits into the 8kb texture cache) - in which case you have a lot more bandwidth using textures (due to the cache hits) than you would if you were using gmem (fully coalesced).
there’s also partition camping http://forums.nvidia.com/index.php?showtopic=96423
Very interesting. I think this is not mentioned on the Cuda programming guide.