2D Matrix operation

cuixue · January 2, 2015, 10:38pm

Hi guys:

I’m really stuck in how to allocate device memory for store data from cpu. Should I make it in vector way? I’m a beginner in CUDA, is that very weird and rare to create a 2D Matrix like A[ROW][COL]? I saw some examples, they all make the matrix like [u]A[i * ROW + j][/u].

Thank you for your reply guys!

little_jimmy · January 3, 2015, 6:27am

a) from a parallel perspective, how would you access (read/ write) a 2d matrix?

b) from a parallel perspective, do A[ROW][COL] and A[i * ROW + j] differ? if so, how?

i suppose A[row][col] could be interpreted as a loose form of index notation

however, if it is strict interpreted as implying what it implies: a number of arrays, n == rows, each with depth col, you may soon run into trouble, particularly when elements within the array wish to ‘wrap around’; rather related to element coalescence as well

i am not even sure whether one would even use a[r][c] for host code, for more or less the same reason
i have heard the argument that a[r][c] caches poorly, compared to a[r; c]

wlangdon · January 3, 2015, 3:12pm

There are CUDA routines to create 2D matrices.
They seem to work fine.
Bill

little_jimmy · January 4, 2015, 4:42am

“There are CUDA routines to create 2D matrices”

undisputed; but this equally begs the question: how do such routines allocate the 2d matrix - as a flat 1d array, or as an array of arrays?

Robert_Crovella · January 5, 2015, 10:57pm

If you want to access a doubly-subscripted array on the device (which originated on the host):

int i = global_data[y];

Then this will require a somewhat involved copy sequence from host to device. If you google “CUDA 2D array” you’ll get some idea of what is involved. This is generically in the category of mechanisms requiring a “deep copy” operation, which generally is a form of a nested-copy.

Because of the difficulty associated with deep-copy, it’s often suggested that you “linearize” or “flatten” your data so that it can be referenced using a single pointer (*) instead of a double pointer (**). To “simulate” 2D access to such an array, you can then do something like:

int i = global_data[y*width+x];

For beginning programmers, and without knowing anything further about your code, I think this is the most sensible recommendation.

Other approaches may include:

Do a full deep copy, a fully-worked generic example is given in the first answer here:

[url]cuda - How can I add up two 2d (pitched) arrays using nested for loops? - Stack Overflow

Use host-mapped memory. This will generally result in pretty slow access to the data.
Use CUDA Unified Memory (refer to the programming guide and the blog article:

[url]http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/[/url]

If you know the width of your 2D array at compile time, you can create a set of typedefs that will leverage the compiler to help you, and allow you to use essentially 1D allocations and copies, while still retaining doubly-subscripted access to the data on both host and device. A worked example is given in the answer here:

[url]3d CUDA kernel indexing for image filtering? - Stack Overflow

It’s actually a 3D example, but you should be able to reduce it to 2D in a straightforward manner.

CUDA routines such as cudaMallocPitch and cudaMemcpy2D, despite their names, do not directly handle doubly-subscripted (i.e. double pointer) arrays. You can easily observe this simply by inspecting the parameters to such functions. cudaMemcpy2D does not expect double-pointer (**) parameters. Therefore they don’t directly facilitate doubly-subscripted access in device code. They have a different purpose, which is related to efficient access of data structure that have a 2D nature to them (i.e. that will be accessed in certain patterns).

cuixue · January 26, 2015, 3:18am

txbob:

If you want to access a doubly-subscripted array on the device (which originated on the host):

int i = global_data[y];

Then this will require a somewhat involved copy sequence from host to device. If you google “CUDA 2D array” you’ll get some idea of what is involved. This is generically in the category of mechanisms requiring a “deep copy” operation, which generally is a form of a nested-copy.

Because of the difficulty associated with deep-copy, it’s often suggested that you “linearize” or “flatten” your data so that it can be referenced using a single pointer (*) instead of a double pointer (**). To “simulate” 2D access to such an array, you can then do something like:

int i = global_data[y*width+x];

For beginning programmers, and without knowing anything further about your code, I think this is the most sensible recommendation.

Other approaches may include:

Do a full deep copy, a fully-worked generic example is given in the first answer here:

http://stackoverflow.com/questions/6137218/how-can-i-add-up-two-2d-pitched-arrays-using-nested-for-loops

Use host-mapped memory. This will generally result in pretty slow access to the data.

Use CUDA Unified Memory (refer to the programming guide and the blog article:

http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

If you know the width of your 2D array at compile time, you can create a set of typedefs that will leverage the compiler to help you, and allow you to use essentially 1D allocations and copies, while still retaining doubly-subscripted access to the data on both host and device. A worked example is given in the answer here:

http://stackoverflow.com/questions/14920931/3d-cuda-kernel-indexing-for-image-filtering

It’s actually a 3D example, but you should be able to reduce it to 2D in a straightforward manner.

CUDA routines such as cudaMallocPitch and cudaMemcpy2D, despite their names, do not directly handle doubly-subscripted (i.e. double pointer) arrays. You can easily observe this simply by inspecting the parameters to such functions. cudaMemcpy2D does not expect double-pointer (**) parameters. Therefore they don’t directly facilitate doubly-subscripted access in device code. They have a different purpose, which is related to efficient access of data structure that have a 2D nature to them (i.e. that will be accessed in certain patterns).

Really appreciate your answer! Learnt a lot!

Topic		Replies	Views
2D Array CUDA Programming and Performance	16	76939	January 20, 2012
Help with cuda 2d array CUDA Programming and Performance	6	7446	September 29, 2014
How can I allocate 2-dimensional array on the device memory? CUDA Programming and Performance	5	15709	August 6, 2009
Matrix Multiplication Buggy CUDA Programming and Performance	13	5233	May 5, 2010
multi dimension array CUDA Programming and Performance	26	32759	February 12, 2010
2D matrix addition question CUDA Programming and Performance	7	12162	May 17, 2009
Matrix Calculations/Manipulations CUDA Programming and Performance	1	403	March 20, 2017
2D Array Not Updated CUDA Programming and Performance	6	5232	May 4, 2010
Matrix Reduction CUDA Programming and Performance	7	8305	November 18, 2009
2D array with memcopy2D and Kernel usage CUDA Programming and Performance	4	1278	January 19, 2016

2D Matrix operation

Related topics