2D Matrix operation

Hi guys:

I’m really stuck in how to allocate device memory for store data from cpu. Should I make it in vector way? I’m a beginner in CUDA, is that very weird and rare to create a 2D Matrix like A[ROW][COL]? I saw some examples, they all make the matrix like [u]A[i * ROW + j][/u].

Thank you for your reply guys!

a) from a parallel perspective, how would you access (read/ write) a 2d matrix?

b) from a parallel perspective, do A[ROW][COL] and A[i * ROW + j] differ? if so, how?

i suppose A[row][col] could be interpreted as a loose form of index notation

however, if it is strict interpreted as implying what it implies: a number of arrays, n == rows, each with depth col, you may soon run into trouble, particularly when elements within the array wish to ‘wrap around’; rather related to element coalescence as well

i am not even sure whether one would even use a[r][c] for host code, for more or less the same reason
i have heard the argument that a[r][c] caches poorly, compared to a[r; c]

There are CUDA routines to create 2D matrices.
They seem to work fine.

“There are CUDA routines to create 2D matrices”

undisputed; but this equally begs the question: how do such routines allocate the 2d matrix - as a flat 1d array, or as an array of arrays?

If you want to access a doubly-subscripted array on the device (which originated on the host):

int i = global_data[y];

Then this will require a somewhat involved copy sequence from host to device. If you google “CUDA 2D array” you’ll get some idea of what is involved. This is generically in the category of mechanisms requiring a “deep copy” operation, which generally is a form of a nested-copy.

Because of the difficulty associated with deep-copy, it’s often suggested that you “linearize” or “flatten” your data so that it can be referenced using a single pointer (*) instead of a double pointer (**). To “simulate” 2D access to such an array, you can then do something like:

int i = global_data[y*width+x];

For beginning programmers, and without knowing anything further about your code, I think this is the most sensible recommendation.

Other approaches may include:

  1. Do a full deep copy, a fully-worked generic example is given in the first answer here:


  1. Use host-mapped memory. This will generally result in pretty slow access to the data.
  2. Use CUDA Unified Memory (refer to the programming guide and the blog article:


  1. If you know the width of your 2D array at compile time, you can create a set of typedefs that will leverage the compiler to help you, and allow you to use essentially 1D allocations and copies, while still retaining doubly-subscripted access to the data on both host and device. A worked example is given in the answer here:


It’s actually a 3D example, but you should be able to reduce it to 2D in a straightforward manner.

CUDA routines such as cudaMallocPitch and cudaMemcpy2D, despite their names, do not directly handle doubly-subscripted (i.e. double pointer) arrays. You can easily observe this simply by inspecting the parameters to such functions. cudaMemcpy2D does not expect double-pointer (**) parameters. Therefore they don’t directly facilitate doubly-subscripted access in device code. They have a different purpose, which is related to efficient access of data structure that have a 2D nature to them (i.e. that will be accessed in certain patterns).

Really appreciate your answer! Learnt a lot!