Hi,
i benchmarked 4 cases of searching 1million input values, like B-tree search.
case1: uniformly randomized. (severely branched)
case2: sorted from case1. (slight branch)
case3: duplicate every 32 values from case1. (per-warp accessing the same gram address)
case4: all duplicate some const value (srand(time)).
all tests have a stable performance ordering: 3>4>2>1.
why 3 is faster than 4? i can’t find doc support for gram conflict among warps, or different coalescing behaviors between these 2 cases. any suggestions? thanks!
What is a “Gram conflict”, btw?? Do you mean Global RAM – Global Memory?? If so, what conflict are you talking about? Therez only a mention of “Shared Memory Bank conflicts” in the manual. OR hav I missed sthg in the manual?
On the lighter vein – If you want to see 4 > 3, then just re-order your cases… ha ha haa… :lol:
there’s probably no “gram conflict”.
yes, in this post, gram is global memory.
the timing is like: case1: 40ms; case2: 23ms; case4: 19ms; case3: 15ms.
this ordering and tendency is stable under various data size or time seeds.
for 3>4, my only guess is: the multiple accesses to each gram bank are queued. when all the threads read the same gram address, one bank is long queued while all other banks idle. case3 has scattered queues and is thus faster… like daydreaming uh? :(
will nv fellows cast light on it? thanks.