I have a question just to make sure, that I understand the memory architecture right. If threads in a warp read from the same position in the global memory, the read operations have to be serialized, and the latency is something like NUM_OF_THREADS * GLOBAL_MEM_LATENCY. In other words, the read fetch for thread k has to wait, until the read fetch for thread k-1 is completed. So there is now way how to broadcast the data from global memory to threads as it works in the shared memory. Is that correct?
And if yeas, than I suggest I need to implement a cache in the shared memory. (I tried to used the texture memory, which has a cache, but the performance was somehow worse than in the case of using global memory without cache, for details check: http://forums.nvidia.com/index.php?showtopic=59803). My task is to traverse a tree (which does not fit into the shared memory). My assumption is that all the threads will traverse the same sequence of interior nodes most of the time, so they will access exactly the same memory locations.
So if I use the shared memory as a cache, I might be able to take advantage of the shared memory and its broadcast mechanism. Do you thing it is a good way?
And one more question… If my understanding of the global memory reads is correct (latency = NUM_OF_THREADS * GLOBAL_MEM_LATENCY, where the threads access the same location), would it be better to somehow figure out, if all the threads in the warp access the same location and read the value just by the first thread?
I think the global memory is the main bottleneck for my application so I need to solve this issue.
Please, correct me or post any advices or any relevant experience with global memory access.
Thanks
–jan