all reading same memory slower than per-warp same

i benchmarked 4 cases of searching 1million input values, like B-tree search.
case1: uniformly randomized. (severely branched)
case2: sorted from case1. (slight branch)
case3: duplicate every 32 values from case1. (per-warp accessing the same gram address)
case4: all duplicate some const value (srand(time)).

all tests have a stable performance ordering: 3>4>2>1.
why 3 is faster than 4? i can’t find doc support for gram conflict among warps, or different coalescing behaviors between these 2 cases. any suggestions? thanks!

What is a “Gram conflict”, btw?? Do you mean Global RAM – Global Memory?? If so, what conflict are you talking about? Therez only a mention of “Shared Memory Bank conflicts” in the manual. OR hav I missed sthg in the manual?

On the lighter vein – If you want to see 4 > 3, then just re-order your cases… ha ha haa… :lol:

there’s probably no “gram conflict”.
yes, in this post, gram is global memory.

the timing is like: case1: 40ms; case2: 23ms; case4: 19ms; case3: 15ms.
this ordering and tendency is stable under various data size or time seeds.

for 3>4, my only guess is: the multiple accesses to each gram bank are queued. when all the threads read the same gram address, one bank is long queued while all other banks idle. case3 has scattered queues and is thus faster… like daydreaming uh? :(
will nv fellows cast light on it? thanks.

Answering your question is difficult without full source code (this may be timing issue, for example).

How does first case represent “non-coalesced access”.

How come the second case where you have sorted the array has resulted in global-mem coalescing???

I think as Andrei had said, it would be useful if you could give more details – may b code also

case 1,2 are not related with coalesced, thanks for correcting. already modified the post.

the code is deeply coupled and not easy to show.

(seems not timing issue, thanks the same!)