i benchmarked 4 cases of searching 1million input values, like B-tree search.
case1: uniformly randomized. (severely branched)
case2: sorted from case1. (slight branch)
case3: duplicate every 32 values from case1. (per-warp accessing the same gram address)
case4: all duplicate some const value (srand(time)).
all tests have a stable performance ordering: 3>4>2>1.
why 3 is faster than 4? i can’t find doc support for gram conflict among warps, or different coalescing behaviors between these 2 cases. any suggestions? thanks!