2D Array Not Updated

Hi Everyone,

I’m just trying to get a feel for transferring/copying arrays to and from the GPU, and so I am starting with small, square matrices (eg. 2x2, 100x100). However, when I am using the real data I have, it will be jagged arrays with varying lengths of rows.

I implemented a global function myFun() such that it takes a 1D array and doubles the values inside of it. Like so:

[codebox]///// Kernel.cu /////

global void call(int *src)

{

for (int i = 0; i < 3; i++)

{

	src[i] *= 2;

}

}

extern “C” void thisFun()

{

    int numElements = 3;

int memSize = numElements * sizeof(int);

int *h_a, *d_a;

h_a = (int *)malloc(memSize);

h_a[0] = 1;

h_a[1] = 2;

h_a[2] = 3;



cudaMalloc((void **)&d_a,memSize);

cudaMemcpy(d_a,h_a,memSize,cudaMemcpyHostToDevice);

call<<<1,1>>>(d_a);

cudaThreadSynchronize();

cudaMemcpy(h_a,d_a,memSize,cudaMemcpyDeviceToHost);

}[/codebox]

Now realize, this code works just fine. However, when I go from this to accept 2D arrays, the code runs, but the values are not updated (i.e. it tells me the initial values of the host array rather than the doubled values).

Here’s what I have:

[codebox]///// Kernel.cu /////

global void call(int **src)

{

for (int i = 0; i < 2; i++)

	for (int j = 0; j < 3; j++)

		src[i][j] *= 2;

}

extern “C” void thisFun()

{

    int numRows = 2;

int numCols = 3;

int numElements = 6;

int memSize = numElements * sizeof(int);

int **h_a, **d_a;

h_a = (int **)malloc(numRows * sizeof(int *));

h_a[0] = (int *)malloc(numCols * sizeof(int));

h_a[1] = (int *)malloc(numCols * sizeof(int));

h_a[0][0] = 1;

h_a[0][1] = 2;

h_a[0][2] = 3;

h_a[1][0] = 4;

h_a[1][1] = 5;

h_a[1][2] = 6;

cudaMalloc((void **)&d_a,memSize);

cudaMemcpy(d_a,h_a,memSize,cudaMemcpyHostToDevice);

call<<<1,1>>>(d_a);

cudaThreadSynchronize();

cudaMemcpy(h_a,d_a,memSize,cudaMemcpyDeviceToHost);

}[/codebox]

So, this compiles and runs, but it doesn’t give me the updated values. This is “basically” the same thing as the 1D version, but I don’t know why it doesn’t update the values. I do the cudaMemcpy back as I think it should be done.

Thanks,

Daniel

When you copy a 2 dimensional array you need to copy the indexed pointers individually, instead of copying the entire 2 dimensional array in one swoop. So basically something like the following.

for (unsigned int i = 0; i < width; i++)

	{

		cudaMemcpy(deviceMemory[i], hostMemory[i], sizeof(type) * height, cudaMemcpyHostToDevice); // type can be anything, such as int or float, etc...

		if (cudaGetLastError() != cudaSuccess)

		{

			return false;

		}

	}

I use this exact code in part of my 2 dimensional class for allocating and transfering to/from host. This basically copies every pointer, from a pointer, from the device to the host, and vice versa if you wanted to.

When you start getting in to 2D arrays in CUDA, stuff starts to get a little more complicated, but it’s doable, just sit down and think about what it means to be a pointer, and copying a pointer of a pointer.

[quote name=‘scwizzo’ post=‘1048344’ date=‘Apr 30 2010, 08:22 PM’]

When you copy a 2 dimensional array you need to copy the indexed pointers individually, instead of copying the entire 2 dimensional array in one swoop. So basically something like the following.

[codebox]///// All inside Kernel.cu /////

// Take a 2D array and change each value to 42, to verify modification has succeeded

global void call(int **a)

{

for (int i = 0; i < 3; i++)

{

	for (int j = 0; j < 3; j++)

	{

		a[i][j] = 42;

	}

}

}

int main()

{

    // define the 2D matrix limits

    int numRows = 3;

int numCols = 3;

// Allocate for the 2D matrix

    int **myArray;

myArray = (int **)malloc(numRows * sizeof(int *));

for (int i = 0; i < numRows; i++)

	myArray[i] = (int *)malloc(numCols * sizeof(int));

// give the matrix some unique initial values

    for (int i = 0; i < numRows; i++)

{

	for (int j = 0; j < numCols; j++)

	{

		myArray[i][j] = i * 3 + j;

	}

}

// show the contents before the kernel call

    printf("myArray before:\n");

for (int i = 0; i < numRows; i++)

{

	for (int j = 0; j < numCols; j++)

	{

		printf("%i ",myArray[i][j]);

	}

	printf("\n");

}

// allocate for the 2D matrix on the device

    int **d_a;

cudaMalloc((void **)&d_a,memSize);

// total memory required of the 2D matrix

int memSize = numRows * numCols * sizeof(int);

// invoke the kernel

    call<<<1,1>>>(d_a);

// wait for kernel to finish

    cudaThreadSynchronize();

// scwizzo’s suggestion for copying over each row of memory one at a time instead of all at one time

    for (int i = 0; i < numRows; i++)

{

	cudaMemcpy(myArray[i],d_a[i],numCols * sizeof(int),cudaMemcpyDeviceToHost);

	if (cudaGetLastError() != cudaSuccess)

                   printf("error\n");

}

// show contents after the kernel call

    printf("myArray after:\n");

for (int i = 0; i < numRows; i++)

{

	for (int j = 0; j < numCols; j++)

	{

		printf("%i ",myArray[i][j]);

	}

	printf("\n");

}

// free memory

    cudaFree(d_a);

free(myArray);

return 0;

}[/codebox]

…the problem right now is that the program crashes/terminates during the part where scwizzo suggests that we copy over one row at a time. I am confused at needing to do this since I have read that multidimensional arrays in CUDA, much like C, are contiguous in memory. So, I think we should be able to copy it over all at once, but I might be misinformed.

Are there any ideas?

Thanks,

Daniel

EDIT: I really must be able to copy over things all-at-once. I am going to be copying over large matrices for computation onto the GPU, and I am told that several small copies is worse than one large copy. As such, I feel like there should be a way of doing this. Some people have suggested flattening the matrix, but I am not sure how to index it properly once that happens. I am trying to keep things ‘mostly’ nice and easy, but I will what gets it done right.

Maybe I don’t understand how 2D array allocation works on a GPU. Are there any good SDK examples with the transfer of 2D matrices from the host to the device and vice versa?

Any help would be greatly appreciated.

I just noticed something in your code

int **h_a, **d_a; // <---- d_a is a double pointer, or 2 dimensional

h_a = (int **)malloc(numRows * sizeof(int *));

// ... extra junk here that i dont care about

cudaMalloc((void **)&d_a,memSize); // <---- cudaMalloc takes a 2 dimensional, or double pointer, as its first arg, but you're making it 3 dimensional by adding that &

cudaMemcpy(d_a,h_a,memSize,cudaMemcpyHostToDevice);

Anywho, that probably won’t fix your problem. Back on track, the portion of code I posted was an example, and not a direct copy/paste snippet. Yes it is faster to copy one large chunk of data versus copying lots of small chunks. There is the cudaMalloc2D() function, however I think it’s more trouble than it’s worth since you have a pitch to worry about, especially in kernels, so it’s easier to flatten an array. Anywho…

To “flatten” a 2D array, you index it by the following, which is just an example. I wrote this in the reply window so if it doesn’t compile that’s why :)

int* flat_array;

int width = 100, height = 200;

cudaMalloc(&flat_array, width*height*sizeof(int));

// flat_array is then accessed 2 dimensionally by the following, with

// x and y being the same as if you said flat_array[x][y], but for

// a single dimension it will be flat_array[y*width + x]

I think I see what you are saying. The problem is that I am just trying to get a simple 2D matrix to copy on the GPU. I can’t believe that I can find NO examples of someone actually doing this correctly. I would kill for a working example.

The thing is, however, that my goal is to implement the copying of 2D jagged array into global memory on the GPU. For a small data set, I will have a 10-row matrix with varying lengths of each column… anywhere from 58 to 1211 elements wide.

So, do you see my problem? I know that if I sum up all the elements and multiply by the sizeof(element_type), then that’s the total memory required on the GPU. I just don’t know what WORKING cuda functions can do this for me such that I can still access the 2D jagged array in traditional C/C++ style (i.e. mat[i][j])… and not having to use a stupid pitch.

Which, by the way, doesn’t work correctly. I can set each element in the matrix, on the GPU, to some scalar integer value, but when I use cuPrintf to even print out what’s inside the element, I get garbage values.

2D matrices, come on. Someone has to have done this by now?

Thanks for the input, though. I’m glad someone pays attention to these posts!

And this is the crux of your difficulties. You aren’t trying to get a “simple 2D matrix” to copy. You want to get an array of pointers in host memory to “deep” copy to device memory. And “deep” copying would require some automagical pointer translation and memory allocation functionality. And that doesn’t exist.

You will have to code it yourself, and it involves several steps:

[list=1]

Allocate a fresh “row” array of pointers in host memory

Recursively allocate and copy each “column” array from the host source data to device memory

Assign those device pointers to the “fresh” row array of pointers

Allocate another row array of pointers on the GPU

Copy the host array of device row pointers to the device column array