Has this been implemented before? (Coalesced access pattern)

Okay, I have a 1d array of data. I have an array of indices.

Basically, I want this to be as efficient as possible:

__global__
void kern(int4 const* data, unsigned int const* ids, int4* new_data) {
  auto tid = get_thread_id();
  auto idx = ids[tid]; // this is at least coalesced
  auto datum = data[idx]; // this is certainly not going to be coalesced, potentially

  /* ... */
}

I think a good avenue would be shared memory. The idea is, the values in ids are not necessarily sequential so you may not have these perfect 128 byte reads.

Maybe we write code that just does a dumb read because that’s guaranteed to be coalesced.

auto data_this_thread_does_not_care_about = data[tid];

We could then write this to shared memory. Then, the relevant thread in the block could read the value out of shared memory. How that goes about is something I’m not sure of.

And then what do we do if a thread needs a value that isn’t in the shared memory block? A global load would be a necessity then. Which is fine. I certainly know that we can’t eliminate all uncoalesced accesses but it would be nice to minimize it.

Has an access pattern like this been solved elegantly by someone before? Or does anyone have any good suggests/implementation details? Hopefully the problem that I’m describing is making sense.