I am having some difficulty wrapping my head around how to achieve the best memory coalescing possible. My broader posts have not received any responses, so I decided to try to break it down into smaller chunks and see if that helps…
Consider an application that adds four elements from two different arrays:
float A[SOME_SIZE];
float B[SOME_SIZE];
float C[SOME_SIZE];
for ( int i=0; i<SOME_SIZE-1; ++i )
{
C[i] = A[i] + A[i+1] + B[i] + B[i+1];
}
When translating this to GPU ode, I understand that I should transfer from global memory to shared memory to allow the “i+1” reads to coalesce. I am uncertain if there is a difference between reading from array A, B, or C.
For example, I recall that you can achieve memory coalescing by reading a multiple of 64K ahead of your current position. Does this same concept apply when reading from multiple arrays. Meaning: Should the array sizes of A, B, and C be multiples of 64K in order to achieve memory coalescing when reading/writing A[i], B[i], and C[i]?
Please let me know if I have muddled my question. I will happily try to revise if I’ve made it difficult to understand…
Thanks!