From the guiding it seems that the parallelization of reading happens only when threads read the continuous memory. So why it is like this?
My understanding is if thread n reads position p of global memory, thread n+1 should read p+1, and p can be 1 byte or 2 byte or 4byte because I can use int, int2, and int 4?
Is it right?
Also if thread n reads position p,p+1,p+2,p+3, each an integer, and thread n+1 reads p+4,p+5,p+6,p+7 and so on, the parallelization seems not to happen, but if I change to thread n reads an int4 at the position p and thread n+1 reads another int4, the parallelization will happen? Is it the same when each thread reads 4 integers and one int4?
Thanks!