Suppose the storage of A
in global memory looks like this:
0.1 0.2 0.3 0.4
1.1 1.2 1.3 1.4
2.1 2.2 2.3 2.4
3.1 3.2 3.3 3.4
In your original formulation, the thread 0 rA
variable will contain (after the loop processing is complete)
0.1 1.1 2.1 3.1
i.e. it will contain column 0 of the A
matrix
If we did a refactoring to do vector load of double2
it might look approximately like this:
double2 *rA2 = reinterpret_cast<double2 *>(&(rA[0]));
double2 *A2 = reinterpret_cast<double2 *>(&(A[0]));
int N2 = N/2;
#pragma unroll
for(int i = 0; i < N2; i++)
rA2[i]=A2[gtx + i * lda];
If you went that way, then rA
for thread 0 would have:
0.1 0.2 1.1 1.2
i.e. it would not match the previous case. This is because a vector load, by its nature, loads adjacent quantities, because that is how a vector type is stored in memory. It cannot duplicate a columnwise load.
You could fix this either by reorganizing the data originally stored in A
or else moving data around between threads after the vector load loop is complete.
This is a lot of work for something that I suspect will lead to very little benefit. Therefore I donât intend to flesh it out any further.
(the actual values in rA
might not be 0.1 0.2 1.1 1.2
, but the first two values would be 0.1 0.2
and that is enough to demonstrate the discrepancy. Determining exactly what would be in each thread would depend on a more complete case than what you or I have shown, and as Iâve already indicated, I donât intend to expend the effort to clarify it any further.)
There are situations where data needs to be loaded row-wise and then converted to column wise storage across threads. in that case you can find various forum questions that discuss how to do this (transposing data, with some granularity). Here is one. And the trove library also has this topic in view. Again, I would not recommend taking on this additional complexity for anything I have seen so far in this thread.