Multi-dimensional arrays in a CUDA kernel?

I am looking for some general tips on how to deal with arrays in a CUDA kernel. The computation I want has

Inputs: 2-D array d (size 4K, 8-bit integers) and 3-D array a (size 5K, 32-bit integers)
Outputs: 2-D array s (size 1K, 64-bit integers) and 3-D array p (size 4K, 64-bit integers)

The calculation of s begins, in the C++ pgm which I am converting to run under CUDA, begins with

memset(s, 0, sizeof(s));

Obviously, we should do this only once, so s should be in a shared device memory, and perhaps initialized by a memset-type call from the host?

I think it’s easiest to allocate a 1d array but then use it like a n-dimensional array. You can even cheese C++ proxy types to give 1d arrays arr[y][z] capabilities :P

@MutantJohn: Since I know array dimensions at compile time, there should be some way to get the kernel to do the subscript calculations. Of course the 1-D approach is always possible, but then I have to write the subscript calculations explicitly (possibility of error), and probably not as optimized as the compiler can achieve. This could be a significant speed difference for large arrays.

There is something in CUDA where you specify a structure containing the extent of each dimension of a 3-D array, and some functions that deal with that structure, but I have no idea how useful or efficient this technique would be.

It is definitely desirable to write code like a[i][j][k], rather than a[(i*d0+j)*d1+k].

Since CUDA is a language in the C++ family, you can construct multi-dimensional arrays in CUDA in exactly the same way as you would normally do in your C++ programming.

For various practical considerations (e.g. simplicity of host/device copies, performance), a simple contiguous 1D array with macro-based indexing is often preferred in my experience.

@njuffa: Do you have any information on performance differences in the kernel for 1-D vs 3-D arrays?

My suggestion: Run some experiments in the context of your particular use case.

This particular SO item discusses a variety of ways to do multidimensional arrays in CUDA:

https://stackoverflow.com/questions/45643682/cuda-using-2d-and-3d-arrays/45644824#45644824

It specifically links to worked examples of how to do doubly-subscripted (a[i][j]) and triply-subscripted (a[i][j][k]) access, when the array dimensions are known at compile time. As @njuffa has already pointed out, this mechanism is essentially identical to how you could do it in ordinary C or C++ coding, where the array in question is used as an argument to a function (as opposed to an argument to a CUDA kernel). In this case, it should simply be letting the compiler create the array indexing calculation (computing ((i*d0+j)*d1+k)) when you do a[i][j][k], as opposed to you manually writing it yourself, so I would not expect performance to be meaningfully different in that case as opposed to the case where you do “simulated” 3D access and write your own indexing calculation. As @njuffa states, of course, it is always best to run your own experiments to verify this in your particular use case.

The performance penalties I was alluding to are those incurred by the popular (judging by the number of questions in this forum) “array of pointers to row vectors” approach to simulating 2D matrices, i.e. the use of non-contiguous storage while adding another level of indirection for each access to a matrix element.

@njuffa: Running experiments is not feasible, as we do not have the hardware. And benchmarking is only really useful if you can define all the parameters of your usage, and their number is small.

I have no idea how one can program in CUDA without hardware to run the code on. Don’t you have to set up, and run with, a test harness to ensure functional correctness?

I would suggest the following straightforward approach to performance. Where performance does not matter, program in whatever way is functional and meets other objectives. For those codes and those configurations whose performance you care for, let optimization efforts be guided by benchmarking and profiling of those.

The levels of indirection are what I dislike.

Manually managing a buffer as a multi-dimensional array is significantly more flexible than getting something you can just [i][j][k] access.