Multidimensional Arrays multidimensional array allocation

Is it possible to allocate multidimensional arrays on GPU’s global memory?
Actually, I allocate 2 dimensional arrays using linear addressing, hence using just a dimension.
Then, as I need arrays of 2D arrays, I managed to do the following:
-A type a[10] is allocated on host memory ( [10] is static, but you might also consider it dynamic, it’s the same);
-A type **a_device is allocated on the device, in this case with 10
sizeof(type ) size;
-For i=0…9 a[i] is allocated with the correct dimension (width
height of 2D array)
-At last, I copy a’s contents into a_device’s ones, and it works.

Now, for further dimensions this trick is not possible, as I cannot access a_device[n].

Is there a way to allocate multidimensional arrays on device’s memory?

CUDA’s support for multi-dimensional arrays is essentially the same as C’s.

Personally I usually find it easier to just allocate a linear array and do the math to calculate the address in code.

If you’re using 2D arrays you may find it more efficient to use 2D textures since there is hardware that will do the addressing for you.

Well, 2D linear addressing is really easy, and so would be 3D, but I am thinking of at least 5.
Such structure unfortunately is needed by the algorithms I am implementing.
I were also thinking of using something like [dim1dim2][[dim3dim4], which would work the way I’m currently doing, but it didn’t worked.
It just returned 0s.

I use 4D arrays in part of my code. Note that coalescing reads starts to get a little complicated in these cases. I use cudaMallocPitch to allocate a “2D” array with width “L” and height MxMyMz. Then I index into the array by doing all the index calculations by hand. Because of my memory access pattern, a single block accesses all elements along the L axis and are coalesced because I used cudaMallocPitch. To get the right elements, I just need to access the array element at index (i*(MzMy) + jMy + k)*pitch + threadIdx.x, where pitch is in elements, not bytes.

Isn’t indexing with so much calculations slow?
I’m trying to achieve really high performance, due to the fact I’m computing algorithms which on a common computer would take seconds.

Calculations are very cheap, and uncoalesced device memory accesses are very expensive. We are talking about a factor of magnitude 100 here. Sometime it’s even cheaper to recalculate a result then fetching it from global memory.

Oh, that’s really interesting.
Thank you for the answers.
I think I will write a few macros for more human readable 2d/3d and 4d linear mapping, and then convert my actual array code into with 3d accessing of elements in a linear form.