Α beginner's question

I have a beginner’s question. I need to setup an application where each thread in a warp has to read 32 bytes from the global memory. From what i have read, coalescing allows for blocks of 128 bytes to be transferred through caching to be read by half a warp, so it works fine if a warp thread reads 8 bytes.
Does that mean that i need 4 such memory transactions?
Will that significantly degrade performance?

Many thanks in advance

this has been responded in cuda section here
https://devtalk.nvidia.com/default/topic/1056674/cuda-programming-and-performance/beginners-question/post/5357669/#5357669