I think it’s easiest to allocate a 1d array but then use it like a n-dimensional array. You can even cheese C++ proxy types to give 1d arrays arr[y][z] capabilities :P
@MutantJohn: Since I know array dimensions at compile time, there should be some way to get the kernel to do the subscript calculations. Of course the 1-D approach is always possible, but then I have to write the subscript calculations explicitly (possibility of error), and probably not as optimized as the compiler can achieve. This could be a significant speed difference for large arrays.
There is something in CUDA where you specify a structure containing the extent of each dimension of a 3-D array, and some functions that deal with that structure, but I have no idea how useful or efficient this technique would be.
It is definitely desirable to write code like a[i][j][k], rather than a[(i*d0+j)*d1+k].
Since CUDA is a language in the C++ family, you can construct multi-dimensional arrays in CUDA in exactly the same way as you would normally do in your C++ programming.
For various practical considerations (e.g. simplicity of host/device copies, performance), a simple contiguous 1D array with macro-based indexing is often preferred in my experience.
It specifically links to worked examples of how to do doubly-subscripted (a[i][j]) and triply-subscripted (a[i][j][k]) access, when the array dimensions are known at compile time. As @njuffa has already pointed out, this mechanism is essentially identical to how you could do it in ordinary C or C++ coding, where the array in question is used as an argument to a function (as opposed to an argument to a CUDA kernel). In this case, it should simply be letting the compiler create the array indexing calculation (computing ((i*d0+j)*d1+k)) when you do a[i][j][k], as opposed to you manually writing it yourself, so I would not expect performance to be meaningfully different in that case as opposed to the case where you do “simulated” 3D access and write your own indexing calculation. As @njuffa states, of course, it is always best to run your own experiments to verify this in your particular use case.
The performance penalties I was alluding to are those incurred by the popular (judging by the number of questions in this forum) “array of pointers to row vectors” approach to simulating 2D matrices, i.e. the use of non-contiguous storage while adding another level of indirection for each access to a matrix element.
@njuffa: Running experiments is not feasible, as we do not have the hardware. And benchmarking is only really useful if you can define all the parameters of your usage, and their number is small.
I have no idea how one can program in CUDA without hardware to run the code on. Don’t you have to set up, and run with, a test harness to ensure functional correctness?
I would suggest the following straightforward approach to performance. Where performance does not matter, program in whatever way is functional and meets other objectives. For those codes and those configurations whose performance you care for, let optimization efforts be guided by benchmarking and profiling of those.