I’m really stuck in how to allocate device memory for store data from cpu. Should I make it in vector way? I’m a beginner in CUDA, is that very weird and rare to create a 2D Matrix like A[ROW][COL]? I saw some examples, they all make the matrix like [u]A[i * ROW + j][/u].
a) from a parallel perspective, how would you access (read/ write) a 2d matrix?
b) from a parallel perspective, do A[ROW][COL] and A[i * ROW + j] differ? if so, how?
i suppose A[row][col] could be interpreted as a loose form of index notation
however, if it is strict interpreted as implying what it implies: a number of arrays, n == rows, each with depth col, you may soon run into trouble, particularly when elements within the array wish to ‘wrap around’; rather related to element coalescence as well
i am not even sure whether one would even use a[r][c] for host code, for more or less the same reason
i have heard the argument that a[r][c] caches poorly, compared to a[r; c]
If you want to access a doubly-subscripted array on the device (which originated on the host):
int i = global_data[y];
Then this will require a somewhat involved copy sequence from host to device. If you google “CUDA 2D array” you’ll get some idea of what is involved. This is generically in the category of mechanisms requiring a “deep copy” operation, which generally is a form of a nested-copy.
Because of the difficulty associated with deep-copy, it’s often suggested that you “linearize” or “flatten” your data so that it can be referenced using a single pointer (*) instead of a double pointer (**). To “simulate” 2D access to such an array, you can then do something like:
int i = global_data[y*width+x];
For beginning programmers, and without knowing anything further about your code, I think this is the most sensible recommendation.
Other approaches may include:
Do a full deep copy, a fully-worked generic example is given in the first answer here:
If you know the width of your 2D array at compile time, you can create a set of typedefs that will leverage the compiler to help you, and allow you to use essentially 1D allocations and copies, while still retaining doubly-subscripted access to the data on both host and device. A worked example is given in the answer here:
It’s actually a 3D example, but you should be able to reduce it to 2D in a straightforward manner.
CUDA routines such as cudaMallocPitch and cudaMemcpy2D, despite their names, do not directly handle doubly-subscripted (i.e. double pointer) arrays. You can easily observe this simply by inspecting the parameters to such functions. cudaMemcpy2D does not expect double-pointer (**) parameters. Therefore they don’t directly facilitate doubly-subscripted access in device code. They have a different purpose, which is related to efficient access of data structure that have a 2D nature to them (i.e. that will be accessed in certain patterns).