[codebox]warp_id = tid / 32;
for(int i=warp_id; i<num_combine_rows; i+=num_warps)
int row_start = row[i];
int row_end = row[i+1];
for(int j=row_start; j<row_end; j+=32)
All the threads in one warp access the same address in array row, this would result in a uncoalesced access.
But I get no speeding up if I store value of row in constant cache.
Doesn’t this situation satisfy the condition of using constant cache?
Do I misunderstand the coalesced access? This situation won’t result in uncoalesced access.