I am new to CUDA and have a question about coalesced memory reads.
I’m working on a sum reduction kernel as described in.
On page 18 the code gets optimized so each thread requests 2 elements from global memory. As i understand each requests is coalesced so each half warp makes 2 requests of 64b.
Would it be possible to get it all coalesced so each half warp only requests 1 128b instead?
As i understand this would increase performance since only 1 memory read would be necessary.