CUDA 2D Array Problem Need help to manipulate 2D arrays in CUDA

Hi all,

I’m a reasonably experienced programmer but am new to both CUDA and C and am unsuprisingly having a few issues.

Scenario:

    [*]Create an 11 by 11 2D array in C of type int

    [*]Initialise array mainly 0’s with some 1’s

    [*]Using Cuda add 1 to each element in 2D array

    [*]Print array to see if it’s changed

I have written the code and it compiles and runs fine, however it doesn’t behave as expected. The array before and after calling the kernel is the same. It doesn’t add 1, and I’m not sure it’s even copying the array backwards and forwards.

From reading the programming guide it suggests I use cudaMallocPitch and cudaMemcpy2D to allocate the memory on the device and copy it to and from device memory, which is what I’ve done. Hopefully the code below is commented enough to make it easy to follow what I’m trying to do. As you can see I create a pointer ptrHostResult that points to the start of the 2D array automaton, my thinking is that when I call cudaMemcpy2D to copy the result back to the host, the array passed from device memory will overwrite the one on host memory. However it isn’t working.

I have even tried creating another 2D array of all 1’s and copying it into to device memory and back to host memory, but it doesn’t overwrite the original array.

Any help is greatly appreciated :)

#define GRIDWIDTH 11

#define GRIDHEIGHT 11

#include <stdio.h>

#include <stdlib.h>

/*

*

*	CUDA Kernel

*

*/

__global__ void incrementCell(int* ptrDevA, int* ptrDevB, int pitchA, int pitchB){

	int x = threadIdx.x;

	int y = threadIdx.y;

	int temp = ptrDevA[y * pitchA + x];

	ptrDevB[y * pitchB + x] = temp + 1;

}

/*

*

*	Host Code

*

*/

int automaton[GRIDHEIGHT][GRIDWIDTH];

void printAutomaton(void);

void initAutomaton(void);

int main(){

	/*

	*	CUDA Stuff

	*/

	

	// 1 Block with a 11 * 11 Grid of threads

	int numBlocks = 1;

	dim3 threadsPerBlock(GRIDWIDTH, GRIDHEIGHT);

	// Allocate and initialise memory on host

	initAutomaton();

	const int* ptrHost;

	ptrHost = &automaton[0][0];

	// Print initial automaton

	printAutomaton();

	

	// Create pointer to store result

	int* ptrHostResult;

	//ptrHostResult = (int *)malloc((GRIDWIDTH * sizeof(int))*GRIDHEIGHT);

	ptrHostResult = &automaton[0][0];

	// Allocate memory on device, A holds original B will hold result

	int* ptrDevA;

	size_t pitchA;

	cudaMallocPitch((void**)&ptrDevA, &pitchA, GRIDWIDTH * sizeof(int), GRIDHEIGHT);

	int* ptrDevB;

	size_t pitchB;

	cudaMallocPitch((void**)&ptrDevB, &pitchB, GRIDWIDTH * sizeof(int), GRIDHEIGHT);

	// Copy host memory to device memory

	cudaMemcpy2D(ptrDevA, pitchA, ptrHost, sizeof(int), GRIDWIDTH * sizeof(int),

				GRIDHEIGHT, cudaMemcpyHostToDevice);

	

	// Execute the kernel

	incrementCell<<<numBlocks, threadsPerBlock>>>(ptrDevA, ptrDevB, pitchA, pitchB);

	// Copy result from device memory to host

	cudaMemcpy2D(ptrHostResult, sizeof(int), ptrDevB, pitchB, pitchB * GRIDWIDTH, GRIDHEIGHT,

				cudaMemcpyDeviceToHost);

	// Print returned automaton

	printAutomaton();

	exit(0);

}

void initAutomaton(void){

	// Set all cells to 0

	for(int y = 0; y < GRIDHEIGHT; y++){

		/* Loop through each cell */

		for(int x = 0; x < GRIDWIDTH; x++){

			automaton[y][x] = 0;

		}

	}

	// Set line in middle to 1

	automaton[5][3] = 1;

	automaton[5][4] = 1;

	automaton[5][5] = 1;

	automaton[5][6] = 1;

	automaton[5][7] = 1;

}

void printAutomaton(void){

	/* Loop through each row */

	for(int y = 0; y < GRIDHEIGHT; y++){

		/* Loop through each cell */

		for(int x = 0; x < GRIDWIDTH; x++){

			printf("%d ", automaton[y][x]);

		}

		printf("\n");

	}

	printf("\n\n");

}

The pitch is a width in bytes of the device memory array, so you have to coerce the pointers. See e.g. p. 20 of the CUDA 3.2 Programming Guide, which has this code fragment:

__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)

{

    for (int r = 0; r < height; ++r) { 

        float* row = (float*)((char*)devPtr + r * pitch); 

        for (int c = 0; c < width; ++c) { 

            float element = row[c]; 

        } 

    } 

}

Also, you are not checking the return values from cudaMemcpy() - make sure to do so, in case there is a problem with the parameters.

Thanks for the hint there with the pitch I wasn’t entirely sure on that. I have done what you said and checked for errors. I have implemented a smaller simpler example that just tries to copy memory in and out of device memory. I’m getting no error for allocating the memory, but am getting “invalid argument” error every time I do a cudaMemcpy2D. I’ve experimented with various things trying to get the arguments correct but I can’t seem to get it to work.

I’ve gone through each argument 1 by 1 for example on the second cudaMemcpy2D I cast the type for each argument so it should be correct. I know this is not a Cuda question but am I right in thinking an array’s name used in an expression returns a const pointer of that type. In this example a const int* ?

Again any help on how I can narrow down the error to which argument is the problem and how to resolve it is greatly appreciated.

#include "declarations.h"

// Function Declarations

void printAutomatonMem(void);

void initAutomatonMem(void);

// Variable Declaations

int resultMem[GRIDHEIGHT][GRIDWIDTH];

int automaton2Mem[GRIDHEIGHT][GRIDWIDTH];

int automatonMem[GRIDHEIGHT][GRIDWIDTH];

void memCpyTest(){

	

	// Allocate and initialise memory on host

	initAutomatonMem();

	// Print initial automaton

	printAutomatonMem();	

	// Allocate memory on device A holds automaton B will hold result

	int* ptrDevA;

	size_t pitchA;

	cudaError_t errMemAllA = cudaMallocPitch(&ptrDevA, &pitchA, GRIDWIDTH * sizeof(int), GRIDHEIGHT);

	int* ptrDevB;

	size_t pitchB;

	cudaError_t errMemAllB = cudaMallocPitch(&ptrDevB, &pitchB, GRIDWIDTH * sizeof(int), GRIDHEIGHT);

	

	// Copy host memory to device memory

	cudaError_t errMemDevA = cudaMemcpy2D(ptrDevA, pitchA, automatonMem, sizeof(int), GRIDWIDTH * sizeof(int),

				GRIDHEIGHT, cudaMemcpyHostToDevice);

	// Copy automaton2 to B to see if it is copied back and overwrites automaton A

	int width = GRIDWIDTH;

	int height = GRIDHEIGHT;

	cudaError_t errMemDevB = cudaMemcpy2D((void *)ptrDevB, (size_t)pitchB, (const void *)automaton2Mem, sizeof(int) * (size_t)width, 

				(size_t)width * sizeof(int), (size_t)height, cudaMemcpyHostToDevice);

	// Copy result from device memory to host

	cudaError_t errDevMem = cudaMemcpy2D(resultMem, sizeof(int) * GRIDWIDTH, ptrDevB, pitchB, 

				pitchB * GRIDWIDTH, GRIDHEIGHT,cudaMemcpyDeviceToHost);

	// Print returned automaton

	printAutomatonMem();

	printf("\n");

	printf("Memory Allocation A: %s\n", cudaGetErrorString(errMemAllA));

	printf("Memory Allocation B: %s\n", cudaGetErrorString(errMemAllA));

	printf("Memory to Device A: %s\n", cudaGetErrorString(errMemDevA));

	printf("Memory to Device B: %s\n", cudaGetErrorString(errMemDevA));

	printf("Device to Memory: %s\n", cudaGetErrorString(errDevMem));

	system("pause");

}

void initAutomatonMem(void){	

	int numRows, numCols;

	numRows = GRIDHEIGHT;

	numCols = GRIDWIDTH;

	/* Initialize automaton to be filled with 0's. */

	for(int y = 0; y < numRows; y++)

		for(int x = 0; x < numCols; x++)

			automatonMem[x][y] = 0;

	/* Initialize automaton2 to be filled with 1's. */

	for(int y = 0; y < numRows; y++)

		for(int x = 0; x < numCols; x++)

			automaton2Mem[x][y] = 1;

	// Set line in middle to 1

	automatonMem[5][3] = 1;

	automatonMem[5][4] = 1;

	automatonMem[5][5] = 1;

	automatonMem[5][6] = 1;

	automatonMem[5][7] = 1;

}

void printAutomatonMem(void){

	printf("===========AUTOMATON============\n");

	/* Loop through each row */

	for(int y = 0; y < GRIDHEIGHT; y++){

		/* Loop through each cell */

		for(int x = 0; x < GRIDWIDTH; x++){

			printf("%d ", automatonMem[y][x]);

		}

		printf("\n");

	}

	printf("\n\n");

	

	printf("============AUTOMATON2============\n");

	for(int y = 0; y < GRIDHEIGHT; y++){

		/* Loop through each cell */

		for(int x = 0; x < GRIDWIDTH; x++){

			printf("%d ", automaton2Mem[y][x]);

		}

		printf("\n");

	}

	printf("\n\n");

	printf("============RESULT============\n");

	for(int y = 0; y < GRIDHEIGHT; y++){

		/* Loop through each cell */

		for(int x = 0; x < GRIDWIDTH; x++){

			printf("%d ", resultMem[y][x]);

		}

		printf("\n");

	}

	printf("\n\n");

}

This invocation of cudaMemcpy2D is specifying a source pitch (fourth parameter) of sizeof(int). Both pitches must be greater than or equal to the width-in-bytes of the memcpy.

Thank’s again for the reply, I’ve got it working now! External Image . Code is below incase anyone else finds the thread.

Just to clarify the meaning of the arguments so I’m sure I understand.

In the case of copying from host to device:

Destination Pitch(2nd argument): This is the value passed back from cudaMallocPitch, and is the width in bytes of a row in the device memory. And as such this value should be used when accessing the array on the device.

Source Pitch(4th argument): This is the width in bytes of a row in the host memory, which is simply: sizeof(type) * width of array.

Width of matrix transfer(5th argument): This is the width in bytes of the amount of data that is copied. Which in this case is the same as the source pitch.

Now from what I have read the reason cudaMallocPitch returns the pitch value is in case it pads the array in device memory for performance reasons. So there might be x amount of extra bytes on the end of each row in device memory. Which is why the devicePitch value must be used when accessing the array in device memory.

In the case of copying from device to host the source and destination pitch still mean the same thing. However even if the array has been padded in device memory the 5th argument, the width of the array in bytes, is still width of array * sizeof(type).

Is all of the above correct? If someone could clarify these points I would be extremely grateful as it would help me understand what is actually happening which is better rather than just hacking at it until it works.

// Allocate memory on device A holds automaton B will hold result

int* ptrDevA;

size_t pitchA;

cudaError_t errMemAllA = cudaMallocPitch(&ptrDevA, &pitchA, GRIDWIDTH * sizeof(int), GRIDHEIGHT);

int* ptrDevB;

size_t pitchB;

cudaError_t errMemAllB = cudaMallocPitch(&ptrDevB, &pitchB, GRIDWIDTH * sizeof(int), GRIDHEIGHT);

// Copy host memory to device memory

cudaError_t errMemDevA = cudaMemcpy2D(ptrDevA, pitchA, automatonMem, sizeof(int) * GRIDWIDTH, GRIDWIDTH * sizeof(int), GRIDHEIGHT, cudaMemcpyHostToDevice);

// Copy automaton2 to B to see if it is copied back and overwrites automaton A

int width = GRIDWIDTH;

int height = GRIDHEIGHT;

cudaError_t errMemDevB = cudaMemcpy2D(ptrDevB, pitchB, automaton2Mem, sizeof(int) * GRIDWIDTH, GRIDWIDTH * sizeof(int), GRIDHEIGHT, cudaMemcpyHostToDevice);

// Copy result from device memory to host

cudaError_t errDevMem = cudaMemcpy2D(resultMem, sizeof(int) * GRIDWIDTH, ptrDevB, pitchB, sizeof(int) * GRIDWIDTH, GRIDHEIGHT, cudaMemcpyDeviceToHost);