My kernel hast N threads and it must read an array of N+1 floast on each invocation. The array of N+1 floats is allocated in global memory with MemAllocPitch to meet the alignment requirements. The problem is that I am not sure how to get coalesced access with this pattern. So far my code dos the following:
Thread i loads element i from the array into shared memory (i from 0 to N-1).
If i==N-1 then also load element N.
Is the last thread breaking the coalesced memory access? If so, would a texture-based access pattern improve performance?