Allocating a multidimensional array onto a device variable

Forgive me for being very new to cuda. I started about a week and a half ago. I am learning with a goal in mind so forgive me if I am trying to do something complicated and do not know some of the basics required.

I want take an array of pointers (I need the host to be pointers for the dynamic memory. I do not know the size of the array at the time of compilation because I am reading data from a file) and then put it into a global device variable for access. Once it is in device memory the size will not change but the values will.

I had hoped I could do this like any other global device variable using cudaMemcpyToSymbol but I get invalid argument so I asssume that multidimensional arrays require a different approach. I have also read somethings about flattening the array but that is not a practical option for me because I must compare patches of data instead of just individual values. They were also from 7 years ago so I dont know if support for copying and traversing multidimensional arrays has been implemented or not. (

What would the code look like for this? Is this possible in cuda? Is there something I should go back and study? Is there a good tutorial for this?

Thank you!

CUDA is a language in the C/C++ family, so multi-dimensional arrays work exactly the same as they work in C/C++.

For efficiency it is usually best to allocate n-dimensional array as one single dynamic allocation, and do the index arithmetic yourself. You can hide the details in macros, for example, that is what I do myself.

You will have to decide which storage convention you want to use, row major or column major. C/C++ typically use row-major storage, however CUBLAS uses column-major storage for easy interoperability with large amounts of numerical code written in Fortran, and important middleware like MATLAB. So if you plan to use CUBLAS I would suggest column-major storage.

I don’t understand your point about being able to “compare patches of data”. The compact storage scheme (one single allocation for entire matrix) easily lends itself to operating on arbitrary sub-matrices. Check out CUBLAS, it may already provide some or all of the functionality you need, and pretty much every API call allows operations on sub-matrices (to be clear: this is not a feature specific to CUBLAS but is a feature of all BLAS implementations).

If your matrices are sparse, take a look at CUSPARSE instead.

“using cudaMemcpyToSymbol but I get invalid argument”

said api is generally for constant memory space; hence, when used, the api would expect what you are likely not passing - a pointer to constant memory space

if i read your post correctly - i would not necessarily call numerous arrays, a multi-dimensional array
hence, even though you may have multiple arrays, i am not necessarily convinced that it is a multi-dimensional array

your problem may be more a problem of mere synchronization - the host needing to set up arrays and values, and ensure that this is done and correct by the time the device will access it

and as njuffa pointed out; a single dimensional flat array with proper indexing may suffice
instead of passing multiple pointers to the device, the host can pass a single pointer, together with multiple offsets/ indices

@njuffa Nevermind that bit about patches. I think I need to play with cuda a little more before I attempt what I am trying to do in the end.

I will also study some cublas too.

In the meantime would you be willing to direct me to a good, easy example of how to copy a multidimentsional array to a global device variable? Do you still use cudaMemcpyToSymbol?

@little_jimmy I am comparing data between two 3 dimensional arrays. I was trying to be general with my question. Sorry about that. Both arrays need to be copied to the device in a global variable to avoid multiple expensive copy calls. If my design is correct. But maybe I should be studying how cuda works before I try to fit what cuda can do to my design. Either way though I would like to have those arrays in global memory on the device.

I am not a friend of using cudaMemcpyToSymbol() and have never used it in an actual project. You shouldn’t need it other when updating constant memory between kernel calls. If you have an n-dimensional matrix contained in a single allocation (malloc() on the host, cudaMalloc() on the device), simply copy it over with one cudaMemcpy() call, or cudaMemcpyAsync() if you are using CUDA streams to get overlap between host/device copies an kernel execution.

Where programmers typically run into issues is if they do not use the compact single-allocation method of storing matrices, but instead keep collections of row or column vectors, which are thus allocated as separate non-contiguous chunks, and need to be copied over to the GPU one vector at a time, which is not conducive to performance.

“Both arrays need to be copied to the device in a global variable”

Both arrays need to be copied to the device in a global variable array