Data load question

If I need to load array A from global memory to each threads register, will it be faster to load it to shared memory first, then load to registers? (Assume there are no bank conflict, and all data are aligned).
Here is my case. I have 4 warps per CTA, each warp need N bytes from array A (in gmem). But warp 0, 1 need the same, 2, 3 need the same. Will it be faster to load A to smem before load to registers in this specific case? What about the general case?

What do you mean with load to shared memory? With asynchronous memory copy operations or by loading it with the threads and storing it to shared memory?

If you need the same bytes in different lanes, they only have to be loaded once. So that is efficient.
But you should at least load (aligned) 32 byte blocks (coalescing) with each access, regardless which threads needs those.
Sometimes it helps to do 8-bytes or 16-bytes accesses to achieve that.

Not async. I will try to load to smem and then to register. Thanks!

I made a mistake.
You wrote about warps needing the same data, not threads.

Then you should either exchange with shared memory or trust the L1 cache.
Let only the even warps load data from global memory and write to shared memory.
Afterwards let only the odd warps load data from shared memory.

You can either use a block-wide sync or use inline ptx for numbered barriers for pairwise warps. The second option is slightly faster, as not all warps are blocked.