Global memory latency ... and shared memory as a cache

I have a question just to make sure, that I understand the memory architecture right. If threads in a warp read from the same position in the global memory, the read operations have to be serialized, and the latency is something like NUM_OF_THREADS * GLOBAL_MEM_LATENCY. In other words, the read fetch for thread k has to wait, until the read fetch for thread k-1 is completed. So there is now way how to broadcast the data from global memory to threads as it works in the shared memory. Is that correct?

And if yeas, than I suggest I need to implement a cache in the shared memory. (I tried to used the texture memory, which has a cache, but the performance was somehow worse than in the case of using global memory without cache, for details check: http://forums.nvidia.com/index.php?showtopic=59803). My task is to traverse a tree (which does not fit into the shared memory). My assumption is that all the threads will traverse the same sequence of interior nodes most of the time, so they will access exactly the same memory locations.

So if I use the shared memory as a cache, I might be able to take advantage of the shared memory and its broadcast mechanism. Do you thing it is a good way?

And one more question… If my understanding of the global memory reads is correct (latency = NUM_OF_THREADS * GLOBAL_MEM_LATENCY, where the threads access the same location), would it be better to somehow figure out, if all the threads in the warp access the same location and read the value just by the first thread?

I think the global memory is the main bottleneck for my application so I need to solve this issue.

Please, correct me or post any advices or any relevant experience with global memory access.

Thanks
–jan

The shared memory broadcast could help: but keep in mind that you will need __syncthreads() which will impose block wide synchronization and limit the amount of warp interleaving that can be accomplished.

Don’t worry about the latency. What matters more is that multiple threads reading the same value like that aren’t reading it coalesced which is what hurts your performance.

I realize that with the tree data structure it will take a little work, but the only way to know what kind of performance you are getting is to count the total number of bytes read/written and divide by the time to get an effective GiB/s memory performance. 70 GiB/s is achievable under optimal coalescing/cached conditions.