Matrix Multiplication -- Why do we 'flatten' matrices into a linear space? Tradeoffs between

Suppose I have two square matrices, A (2x2) and B (2x2). The product AB is stored in C.

If I declare the arrays in C with the notation:

``````int A[2][2];

int B[2][2];

int C[2][2];
``````

I can easily refer to elements using operator condition. However, I am seeing from numerous examples (incl. NVIDIA_SDK) that programmers seem to flatten these 2D matrices into a 1D dimension (where one can choose from row-order or column-order).

Why do we do this? Is it easier to manipulate matrix elements by referring to one index (Using: x + y * Dim.x)? Performance-wise, would be faster to compute C[x+y*Dim.x] than C[y]?

Thanks

When we do it with static arrays, the compiler is able to flatten it into a single linear array and do the indexing for you. When you do it with dynamic arrays allocated with malloc/new, the compiler has no idea what the dimensions are (they change at runtime), so it can’t do the indexing for you. You have to flatten it yourself and do the indexing yourself.

Thanks. I forgot about static vs. non-static arrays (created with new/malloc). After some additional research also: in the end, it’s simply easier to deal with a 1D matrix.

I was under the impression that A[i][j] was nothing more than ((A+i)+j).

In fact, I do this for dynamically allocated arrays all the time in CPU host code.

The real problem is that CUDA does not support pointer indirection. You cannot load an array of pointers from host to device and then load the arrays those pointers are pointing to in one seamless step.

Not on a statically declared two dimensional array (which is what the question is about). The compiler will compute the total size, allocate the space statically and use linear indexing into that static allocation. Only a single level of pointer indirection required.

Of course it does. How could CUDA support pointers at all if it didn’t support indirection?

That isn’t pointer indirection. That is portability of pointers between host and device memory spaces, which is a completely different issue.

Yes, but I was referring more to:

I dont understand why the compiler needs to know the dimension size. Assuming you could load the pointers and the array correctly into the GPU memory, you could write ((A+i)+j) everywhere you want A[i][j]. Why is this not done? It’s ok with 2D arrays, but when you get to 4D arrays, it’s quite a pain in the behind to figure out the indexing.

You are correct that I was abusing “pointer indirection”. Pointer to pointer within the device obviously actually works. I was trying to point out that you cant load host pointers into the device and automagically expect them to be properly indirected when on the GPU.

It would also be incredibly slow.