I didn’t know there is a cudaMalloc2D(), and I can’t find such a function in the CUDA documentation. There is cudaMallocPitch(), is that what you are referring to? The prototype is
I would strongly suggest to make use of the current documentation rather than outdated documentation from 2007. What you have there looks like the very first Programming Guide from before the CUDA 1.0 release; the current CUDA version is 6.5. Please refer to
CUDA ships with many example programs that demonstrate basic concepts such as memory allocation. I would suggest working through them. You may also find it helpful to work with an introductory CUDA book, such as “CUDA by Example”.
If you are a beginner GPU programmer, I would encourage you to “flatten” your 2D array and handle it in 1D fashion, perhaps using subscript arithmetic to simulate 2D access.
That means:
allocate an ordinary 1D array of size NxN (your example). You can use malloc on the host, and cudaMalloc on the device
use ordinary 1D cudaMemcpy to transfer this array to the device.
Access it as an ordinary 1D array on the device. If you want to simulate 2D access, do something like:
int x = global_data[i*num_cols+j];
where i would be your row index (first subscript in a doubly subscripted array) and j would be your column index. num_cols is just N in your case, the number of columns in your 2D matrix.