Multi-dimensional arrays in a CUDA kernel?

BarryCuda · August 22, 2017, 5:51am

I am looking for some general tips on how to deal with arrays in a CUDA kernel. The computation I want has

Inputs: 2-D array d (size 4K, 8-bit integers) and 3-D array a (size 5K, 32-bit integers)
Outputs: 2-D array s (size 1K, 64-bit integers) and 3-D array p (size 4K, 64-bit integers)

The calculation of s begins, in the C++ pgm which I am converting to run under CUDA, begins with

memset(s, 0, sizeof(s));

Obviously, we should do this only once, so s should be in a shared device memory, and perhaps initialized by a memset-type call from the host?

MutantJohn · August 26, 2017, 3:42am

I think it’s easiest to allocate a 1d array but then use it like a n-dimensional array. You can even cheese C++ proxy types to give 1d arrays arr[y][z] capabilities :P

BarryCuda · August 26, 2017, 5:50am

@MutantJohn: Since I know array dimensions at compile time, there should be some way to get the kernel to do the subscript calculations. Of course the 1-D approach is always possible, but then I have to write the subscript calculations explicitly (possibility of error), and probably not as optimized as the compiler can achieve. This could be a significant speed difference for large arrays.

There is something in CUDA where you specify a structure containing the extent of each dimension of a 3-D array, and some functions that deal with that structure, but I have no idea how useful or efficient this technique would be.

It is definitely desirable to write code like a[i][j][k], rather than a[(i*d0+j)*d1+k].

njuffa · August 26, 2017, 6:06am

Since CUDA is a language in the C++ family, you can construct multi-dimensional arrays in CUDA in exactly the same way as you would normally do in your C++ programming.

For various practical considerations (e.g. simplicity of host/device copies, performance), a simple contiguous 1D array with macro-based indexing is often preferred in my experience.

BarryCuda · August 26, 2017, 7:45am

@njuffa: Do you have any information on performance differences in the kernel for 1-D vs 3-D arrays?

njuffa · August 26, 2017, 7:48am

My suggestion: Run some experiments in the context of your particular use case.

Robert_Crovella · August 26, 2017, 9:35am

This particular SO item discusses a variety of ways to do multidimensional arrays in CUDA:

[url]c++ - CUDA, Using 2D and 3D Arrays - Stack Overflow

It specifically links to worked examples of how to do doubly-subscripted (a[i][j]) and triply-subscripted (a[i][j][k]) access, when the array dimensions are known at compile time. As @njuffa has already pointed out, this mechanism is essentially identical to how you could do it in ordinary C or C++ coding, where the array in question is used as an argument to a function (as opposed to an argument to a CUDA kernel). In this case, it should simply be letting the compiler create the array indexing calculation (computing ((i*d0+j)*d1+k)) when you do a[i][j][k], as opposed to you manually writing it yourself, so I would not expect performance to be meaningfully different in that case as opposed to the case where you do “simulated” 3D access and write your own indexing calculation. As @njuffa states, of course, it is always best to run your own experiments to verify this in your particular use case.

njuffa · August 26, 2017, 9:44am

The performance penalties I was alluding to are those incurred by the popular (judging by the number of questions in this forum) “array of pointers to row vectors” approach to simulating 2D matrices, i.e. the use of non-contiguous storage while adding another level of indirection for each access to a matrix element.

BarryCuda · August 26, 2017, 10:35pm

@njuffa: Running experiments is not feasible, as we do not have the hardware. And benchmarking is only really useful if you can define all the parameters of your usage, and their number is small.

njuffa · August 27, 2017, 1:14am

I have no idea how one can program in CUDA without hardware to run the code on. Don’t you have to set up, and run with, a test harness to ensure functional correctness?

I would suggest the following straightforward approach to performance. Where performance does not matter, program in whatever way is functional and meets other objectives. For those codes and those configurations whose performance you care for, let optimization efforts be guided by benchmarking and profiling of those.

MutantJohn · August 27, 2017, 3:59pm

The levels of indirection are what I dislike.

Manually managing a buffer as a multi-dimensional array is significantly more flexible than getting something you can just [i][j][k] access.

Topic		Replies	Views
Multi-dimensional array syntax support in CUDA: host, device, ...? CUDA Programming and Performance cuda	3	164	July 30, 2024
Passing a multidimensional array to kernel how to allocate space in host and pass to device? CUDA Programming and Performance	12	16392	November 22, 2014
Is it possible to process multidimensional arrays inside the kernel? CUDA Programming and Performance	13	9215	March 31, 2015
How to pass 3D array in CUDA? CUDA Programming and Performance	1	3726	February 15, 2015
general idea on dealing with 2d(3d) array in cuda can i put 2d(3d) indexes in CUDA kernels? CUDA Programming and Performance	9	1867	November 10, 2010
How to define a three-dimensional array? define a three-dimensional array on GPU CUDA Programming and Performance	13	12004	October 10, 2008
Allocating a multidimensional array onto a device variable CUDA Programming and Performance	6	1702	July 15, 2015
Manage multiarray in kernel CUDA Programming and Performance	1	2195	June 19, 2009
Is passing and operating on an array with more than one dimension in CUDA impossible, or is it not done for performance reasons? CUDA Programming and Performance	4	664	March 26, 2023
Multidimensional Arrays multidimensional array allocation CUDA Programming and Performance	6	6412	December 8, 2007

Multi-dimensional arrays in a CUDA kernel?

Related topics