First CUDA program... looks good, executing wrong?

Alright, I don’t even know what to make of this anymore, I’ve tried simplifying this down as much as I can and I still can’t figure this out. Simple addition of two arrays. The first element returned is the correct sum of the two values. The second through n-th elements are random values, not sure where they’re coming from. I’m sure this is one of those “duh” moments but forgive me, I’m completely new to CUDA and parallel programming in general. Does anything jump out at anyone about this?

/*****************************************

			MATRIX BUILDER

*****************************************/

void buildMatrix(float* matrixElements){

	for(uint i = 0; i < 100; i++){

		// generate a random value n { 0 < n < 9 }

		matrixElements[ i ] = rand() % 10;

	}

}

/*****************************************

			CUDA MATRIX ADDER

*****************************************/

__global__ void addKernel(float* A, float* B, float* C) {

	C[threadIdx.x] = A[threadIdx.x] + B[threadIdx.x];

}

/******************************************

			MAIN PROGRAM

******************************************/

int main()

{

	srand( time (NULL) );

	// CREATE A

	float elementsA[100];

	float elementsB[100];

	float elementsC[100];

	// FILL WITH RANDOM ELEMENTS

	buildMatrix(elementsA);

	buildMatrix(elementsB);

	// ALLOCATE THE ELEMENTS TO THE DEVICE

	float* deviceElA;

	float* deviceElB;

	float* deviceElC;

	cudaMalloc((void**) &deviceElA, sizeof(elementsA));

	cudaMalloc((void**) &deviceElB, sizeof(elementsB));

	cudaMalloc((void**) &deviceElC, sizeof(elementsC) * sizeof(float));

	cudaMemcpy(deviceElA, elementsA, sizeof(elementsA), cudaMemcpyHostToDevice);

	cudaMemcpy(deviceElB, elementsB, sizeof(elementsB), cudaMemcpyHostToDevice);

	// DISPATCH TO THE KERNEL

	addKernel <<<1, 100>>>(deviceElA, deviceElB, deviceElC);

	// COPY THE VALUES BACK

	cudaMemcpy(elementsC, deviceElC, sizeof(deviceElC), cudaMemcpyDeviceToHost);

	// ITERATE THROUGH THE RESULTS

	for(int i = 0; i < 100; i++){

		cout<<i<<":  "<<elementsA[i]<<" + "<<elementsB[i]<<" = "<<elementsC[i]<<endl<<endl;

		cout.flush();

	}

	cudaFree(deviceElA);

	cudaFree(deviceElB);

	cudaFree(deviceElC);

	return 0;

}

The whole thing compiles without complaint, and executes without error, except for returning garbage values, example below:

0:  2 + 5 = 7

1:  1 + 6 = 3.27853e-39

2:  1 + 2 = -1.55064

3:  7 + 8 = 3.47529e-39

4:  7 + 2 = 7.00649e-45

5:  8 + 4 = -2.91408e-05

6:  2 + 3 = 0

7:  9 + 0 = 0

8:  9 + 4 = 1.4013e-45

9:  4 + 7 = -2.91491e-05

.... (cut out the rest for sake of it all being the same)

Many thanks to anyone who can slap me across the back of the head and point out what I’m doing wrong… because I’m pretty confused and frustrated…

cudaMemcpy(elementsC, deviceElC, sizeof(deviceElC), cudaMemcpyDeviceToHost);

Here, you’re calculating the size of a pointer to memory, an not the size of the array it points to.

It’s better to calculate the size of an array as numberOfElements*sizeof(element)

N.

got it!

sizeof(deviceElC) = 4 - you are copying just 4 bytes

deviceELC is a pointer

Oh ffs… I knew this was going to be pointer related… I’m so abhorrently bad with pointer management it’s not funny. -_-;

Thanks, sorry for the dumb question. lol