Well, you firstly asked whether,
ind = ( threadIdx.y + blockIdx.y * blockDim.y ) * NumberOfCols + ( threadIdx.x + blockIdx.x * blockDim.x)
would be coalesced
I replied that I doubt that it is - the spacing between elements should be proportional to numberofcols if I am not mistaken
Secondly, you asked:
“how can I assure that myshared[0+1] will refer to myglobal[0] and not myglobal[4] for example?”
I replied that (given the very definition/ formula of ind,) ind would likely differ for each thread; hence each thread would calculate and store its own ind value, I would think
If this is so, each thread would store ind either as a local or shared variable; in the latter case, it would probably imply something like ind[i]
If each thread uses its ind to access global memory, and its ind value is stored in shared rather than local memory, you would read global memory as myshared[i] = myglobal[ind[i]]
Put differently, ind declared as a local variable should be the same as ind stored in shared memory, per thread - e.g. ind[i] (ind == ind[i])
Do all threads (in a block) use the same ind, or do each thread calculate its own ind - I would think the latter
Perhaps you should be more specific as to why you think “myshared[0+1] will refer to myglobal[0] and not myglobal[4] for example”