Global memory access

Hello. Tell me how to do better. In my kernel each half-warp load data from his 128 byte segment in global memory.Array in global memory aligned to 128b. Data 64 bit.
Those. for once, I want that every half-warp take 128 bytes at 1 transaction.
It’s real? Or better take first 64 bytes (16 threads on 4 bytes), and then take next 64 bytes. It will be faster?

This post somewhat addresses the issue;

https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

in general if you can have each thread load a 128 bit segment(16 bytes) then this will usually be faster than a 32 bit(4 bytes) or 64(8 bytes) bit word per thread.

For 64 bit number you can perform vectorized loads of the double2 type for floating point or ulonglong2 type for unsigned integer.

Thank you very much for helping