multi dimension array


I am really new to Cuda just started with the help of Cuda programming guide and tool kit. In the matrix multiplication is it compulsory to declare it in a single dimension …
when i was declaring this Kernel

//kernel declaration

__global__void matmul( float m, floatn, float*p, int width)

int tx=threadIdx.x;
int ty=threadIdx.y;
int bx=blockIdx.x;
int by=blockIdx.y
float tile_width;
float row=bytile_width+ty;
float col=bx

float k;
int row,col;
float pvalue=0;



row=(int)m [bywidth+ty];
col=(int)n [bx

pvalue+=m[row][k]* n[k][col];

on debugging it shows the errors as follows in the second last line of program…
error1: expression must have pointer to object type.
error2: expression must have integer or enum type.

Please help me. This problem really screw me a lot and even with my best effort I am not able to find it’s solution.

Thanks in advance…

Instead of pvalue+=m[row][k]* n[k][col];

pvalue += m[row*width+k]n[kwidth+col];

and instead of p[row][col]=pvalue;

p[row*width+col] = pvalue;

pvalue and p are (type *) pointers, p would be (type **), an array of pointers to pointers. This is standard C fare.

Piyush: To further elaborate this concept consider the following code sample:



int a[3][4]= {

                  1, 2, 3,4

                  5, 6, 7, 8,

                   9, 0, 1,6


display(a,3,4) /* Display the content of 2D matrix aa


display(int *q, int row, int col)


int i, j;

for(i=0; i<row; i++ )


for(j=0; j<col; j++)

  printf("%d", *(q+i*col+j)); 



the statement “(q+icol+j)” fetches the elements of the 2D matrix one by one. You can verify this by assuming any base adrress, taking i=2, j=3 and calculating (q+icol+j). It will come out a[2][3].

Further, you should remember that each , decreases the level of pointerness by one degree. For example, if int *a, then you can use a. But if you write a it will be wrong, (It has lost all the degrees of pointerness) because it is equivalent to the following statement:

int value


which will be wrong.

However when you write int **a and you try to access a, then it will be fine. **a is known as jagged array, and it should be generally avoided in CUDA.

I don’t understand your reply to my reply. You can read up nicely on this issue in the documentation to FFTW.

int a[3][4]= {

                  1, 2, 3,4

                  5, 6, 7, 8,

                   9, 0, 1,6


is a statically declared a array, which will have the same memory layout, as if you declared it as int a[3*4]. Your comment

(q+icol+j). It will come out a[2][3].

correctly point that out. If however you say int **a; then you have to do the following (in pseudocode):

a = (int**)malloc(rowssizeof(int));

for (i = 0; i < rows; i++)

a[i] = (int*)malloc(cols*sizeof(int));

and then you can e.g. do a[2][3] = 23. With int *p; p = a[2]; you could then say printf("%d", p[3]) and would get 23. p would point to the first element of row 2, as does a[2]. Image processing people do this all the time.

I wrote ctypes Python bindings for CUDA and for that looked at the signature of virtually every function in the CUDA API. You will rarely find an argument of the form type** there and when you do, it will be a for a pointer to something array-ish or structure-ish created on the CUDA side. GPU arrays will just be layed out linearly in memory. Whether you use that as 1D-, 2D- or 3D-arrays, is up to your algorithm.

If I allocated an int-array in CUDA and passed that back to the host via an (int *), I would get the correct value of the pointer as an int, but I assume, C would be quite unhappy if I started to use that as a pointer (even though it is). This is common in the ctypes module of Python, because pointers will basically be integer (32- or 64-bit depending on your system) addresses into some memory space.

Sorry, when I gave my reply to this I did not pay attention, assuming it was a reply from the original poster.

thanks a lot people. thank you so much.


I have some questions about multi dimensional array-s in cuda, first and foremost is there a way to use them int their logical form at all?

if I try to use a kernel to add two matrices, and store it in a third.

But the only way i can make it happen to declare n*m long vectors and pass them to the kernel, and work with them.

What i would like is to use them as matrices, so index them in the kernel as matrix[i][j] where i=threadIdx.x and j=threadIdx.y

I’ve spent hours digging through the documentation, the indexing of the threads are freakin everywhere, but whats the use of (x,y,z) coordinates of threads, and blockID if all examples use it as vectors

so you use dimBlock(n,m) than you use vector[i*n+j] instead of matrix[i][j], this case running the kernel with <<<1,n*m>>> would be much simplier, so i guess there is a way to work with matrices insted of vectors, can anyone PLEASE show me a source code of that?

In CUDA, such a scheme would limit you to m*n<512, which is tiny, and restrict your code to running on as little as 1/30 of the capacity of your card. I am sure that isn’t what you want.

It is trivial to work directly with a block of linear memory instead and it has a number of technical advantages. I usually do something like this:

#define MUL(a,b) __mul24(a,b)

#define GRIDOFFSETD(xpos,ypos,xdim,) ((xpos) + MUL((ypos),(xdim)))

__global__ mykernel(float *myarray, const int Sx)


	int gidx	= MUL(blockIdx.x,blockDim.x) + threadIdx.x;

	int gidy	= MUL(blockIdx.y,blockDim.y) + threadIdx.y;

	int goffs  = GRIDOFFSETD(gidx,gidy,Sx);


Then gidx and gidy are the equivalent (i,j) coordinates and myarray[goffs] is the equivalent of array[i][j] for a column major ordered array. Row major ordering only requires a slightly different declaration of GRIDOFFSETD.

Okay, it seems my question wasnt explicit enough, my main problem is, that i work with two dimensional arrays in the C++ host code, and i want speed up the processing with CUDA. Is there a way to use the arrays as they are, on cuda, instead of this columna major order, because it just dont work for me, I heard that is a way to adress two dimensional arrays in C with only one coordinate, but it just not working for me.

Here’s a little source, which i tried, not surprisingly it does not work.

int** Am;

	int** Bm;

	int** Cm;

		int size=m*n*sizeof(int);

		Cuda cu;



	int *d_A;

	int *d_B;

	int *d_C;










	dim3 dimGrid(1,1);

	dim3 dimBlock(n,m);




//and the kernel

__global__ static void VecAdd(int*  A, int* B, int* C)


	int i = threadIdx.x;

	int j=threadIdx.y;

	C[i*5+j] = A[i*5+j] + B[i*5+j];



It may not compile as it is, because I just copy-d parts of the code together.

I would like to use a kernel which looks someting like this:

__global__ static void VecAdd(int**  A, int** B, int** C)


	int i = threadIdx.x;

	int j=threadIdx.y;

	C[i][j] = A[i][j] + B[i][j];



The simple answer is no. You can’t directly copy a host two dimensional array to the device. If you want row major ordering with 1D memory, just change the macro I posted earlier to this:

#define GRIDOFFSETD(xpos,ypos,ydim,) ((ypos) + MUL((xpos),(ydim)))

and you can use the same C style storage ordering on device and host.

You can’t do that either (even in host C/C++). Two dimensional arrays in C/C++ are really a one dimensional array of pointers. There is no guarantee that the storage used for successive rows is contiguous in memory. And that is also why you can’t directly copy them to the device - they contain host pointers which cannot be used in device memory.

You can have a kernel that does that. However:


[]Such a kernel could only ever work on an array with up to 512 elements total, ie. ij < 512

If the array size is MxN, you will have to do N+1 individual cudaMallocs, N individual cudaMemcpys and then run an initialization kernel to set everything up correctly.

This kernel will be a lot slower than the same operation written with linear 1D memory, because there is a second level of pointer resolution which requires an additional global memory read per data point

Ok i got the point, I have one more question, its not CUDA, do you have an easy way to convert an nm matrix to an nm long vector, and even more importantly, can I concatenate vectors to malloc them as one at the device, so i can make the same operation on lots of smaller matrices at once? (the main question here, that if i have vectors, i have to concatenate them on the host, and upload it as one vector, or i can copy them, one after another into one big allocated space on the device?)

You mean something like this?

#include <cstring>

int main(void)


	const size_t M=8, N=8;

	size_t size = (size_t)(M*N);

	float **array, *vector;

	// Setup 2D memory

	*array = new float[M];

	for(size_t i=0; i<M; i++)

		array[i] = new float[N];


	// Some initialization code goes here


	// Pack the array row wise into 1D memory

	vector = new float;

	for(size_t i=0; i<M; i++)



	// Some cuda operations go here


	// Cleanup

	for(size_t i=0; i<M; i++)

		delete array[i];

	delete array;

	delete vector;

	return 0;


You can do all of that, and it probably should be obvious how to do it from what has been posted in this thread. If you are working with other types of C++ storage (like STL containers and the like), you might be interested in the extremely useful thrust library.

Huge THANKS for the help!! :rolleyes:

On the other hand, some new just popped in mi mind, while I was trying to work with matrices, i found methods like cudaMallocArray(), cudaMemcpy2DToArray(), a special type called cudaArray etc. I’m wondering if there’s no use for arrays because of the frequent memory access (don’t get me wrong, im not doubting what you said, memory operations are the slowest things on the VGA ,that much even I know :"> ), then whats a use for this functions and types, or its just so utterly beyond me that its better for me not to mess with them?

Those CUDA array functions are for allocating and manipulating memory for textures. They won’t help you in this case.

Okay, Thx again!


I’m back again, with more questions :)

Now i have problem with another quite trivial matter, incrementing a variable in kernel, i really cant seem to find a way, to increment a variable and than pass it back to the host code, i’ve wrote a little test code for this, but i have no clue what would the problem be.

[codebox]global static void testKernel(int d_test){



int test(){

int test=0;

int d_test;

size_t size=sizeof(int);







return test;


That won’t work in standard C/C++, let alone in CUDA. In C/C++ If you want to modify an argument, you must pass it by pointer or reference. It sounds like you might find it beneficial to read some introductory material on C programming before you dive too much further into CUDA. If you don’t understand how the language works, you will probably find it near impossible to understand some of the peculiarities and requirements CUDA superimposes on top of standard C (and a limited number of C99/C++ features).

I know that it wouldnt return the value in normal C/C++, i’ve thought that copying the variable back from device, will bring its value alongside, anyway I’ve tryed to use pointer type, but I had some compileing errors, guess it was to late yesterday night and i screwed something up


Okay I’ve tryed everything that come to my mind, i admit i do hate passing pointers, good old way of returning value used to work for me, but i realize its not possible here, but hell it took me time the first time to get my head around pointers, and i guessed i know them alredy, but CUDA makes me feel like i was back at the school desk learning programming, nothing works as its logical for me. And in this case nothing works at all…

For example do i have to allocate memory on device if i use a pointer from host, i guess i have to malloc another pointer on dev but it kinda defeats the whole purpose of using a pointer does it not? and do i have to copy the data back and forth if its a pointer? i have no clue of these things…

I’ve found some tutorial like stuff on the net, this is what it suggest, or at least what i think is logical, it compiles, but still do not increment the variable

[codebox]int *test=0;

global static void testKernel(int* d_test){



size_t size=sizeof(int);



int *h_test;




return *h_test;[/codebox]
avid@cuda:~$ cat

#include <iostream>

#include "cuda_runtime.h"

__global__ static void testKernel(int *d_test){

	*d_test=1; // This must be a pointer


int main(){

	int test=0;

	int * d_test; // must be a pointer

	size_t size=sizeof(int);



	testKernel<<<1,1>>>(d_test); // both block and grid size must be non-zero

	cudaMemcpy(&test,d_test,size,cudaMemcpyDeviceToHost); // pass the device pointer here

	std::cout<<test<<std::endl; // must use the host value here

	cudaFree(d_test); // pass the device pointer here

	return test;


avid@cuda:~$ /opt/cuda/bin/nvcc -o junk.exe 

avid@cuda:~$ LD_LIBRARY_PATH=/opt/cuda/lib64 ./junk.exe 


[quote name=‘avidday’ post=‘963551’ date=‘Dec 12 2009, 02:03 PM’]

[codebox]__global__ static void evaluatorKernel(int* d_vec, float* score)


int i =blockIdx.y*2048+blockIdx.x*512+threadIdx.x;

 if(d_vec[i] == 0 && i%2 == 0)


  if(d_vec[i] == 1 && i%2 != 0)



I guess this will have something to do with the local memory a registers kept for the blocks and grids, which means totally out of my leage :S