I have an interested problem… in global memory I store a precalculated table of integers. The table is made such that table’s accesses return indexes to future accesses, in an iterative manner:
int curr_idx = threadIdx.x;
int next_idx = table[curr_idx];
// which i can then use as:
int new_next_idx = table[next_idx];
The table is rather large, so it needs to sit in global memory. Table values are essentially random, and it takes 50 clock cycles to generate each entry of the table.
Now, I’d like to somehow use this table in a CUDA kernel, but due to the 600 cycle access latency for global memory it would be more efficient to recalculate these values for each thread unless I can coallesce access. However due to the semirandom nature of the table accesses I don’t think i can do this.