a number of my kernels commence with threads consuming an array element, together with the neighbouring array element
d1 = array[x + threadIdx.x];
d2 = array[x + threadIdx.x + 1];
y = f(d1, d2);
if the whole warp participates, reading both d1 and d2 should amount to 2 reads, but not more, if i am not mistaken
if multiple warps participate, the theoretical min number of reads is number of warps + 1, if i am not mistaken
so, is there an argument to pre-read the data into shared memory first?
can only simply issue a synchronization call between reading d1 and d2, such that much of d2 hits the cache?
is the synchronization call even necessary?