Fine-grained Address Control in cooperative_groups::memcpy_async

I am accelerating a workload with tons of memory load from global memory to shared memory. It seems that asynchronous data copy from global memory to shared memory in CUDA 11.0 may help.

In Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture, we can replace

“shared[group.thread_rank() ] = global1[subset * group.size() + group.thread_rank()];”

with

“cooperative_groups::memcpy_async(group, shared,
&global1[subset * group.size()], sizeof(T) * group.size());”

However, this only works for simple mapping between thread index and memory address. What should we do if the mapping between thread index and memory address is more complex? For example:

“shared[ laneId/8] [ laneId %8 ] = global1[ laneId/8 ] [laneId % 8]”

In addition, what is the speedup that we may expect? 2x?