MyFirstCuda

Hi.Im a amateur cuda programmer.I wrote a program that adds two vectors but it doesnt work correctly and give to me an incorrect result.I did upload my file.
im using visual studio 2008 . I havent GPU and i`m using DebugSimulator. Please say me where is the problem.and i have problem with concept of definitions of blocks and grids and importance of them.please help me.thanks :-)
sample.cu (2.57 KB)

Please answer me :unsure:

You are launching kernel on grid with 1 block of size 1x1. so u will get only sum of first elements.
U have to split your vector size (10 in your example) into several blocks. block size also should be multiple of 32.
for example: Vector Size = 1024
then VecAdd<<<16x1, 64x1>>> where 16 is number of blocks and 64 is size of block (1024 = 16 * 64)

I changed my code to :

__global__ void VecAdd(float* A, float* B, float* C)

{ 

	int i = threadIdx.x; C[i] = A[i] + B[i];

} 

int main(int argc, char* argv[])

{

	float *a , *b ,*c;

	a = new float[1024]; b = new float[1024];c = new float[1024];

	dim3 block(1,1),grid(1,1);

	float fa[1024],fb[1024] , fc[1024];

	for(int j=0;j<1024;j++)

	{

		a[j]=b[j]=1;

		c[j] = 0;

	}

	cudaMalloc((void **)&a,1024*sizeof(float));

	cudaMalloc((void **)&b,1024*sizeof(float));

	cudaMalloc((void **)&c,1024*sizeof(float));

	

	cudaMemcpy(a,fa,1024*sizeof(float),cudaMemcpyHostToDevice);

	cudaMemcpy(b,fb,1024*sizeof(float),cudaMemcpyHostToDevice);

	cudaMemcpy(c,fc,1024*sizeof(float),cudaMemcpyHostToDevice);

	

	VecAdd<<<16, 64>>>(a,b,c);

	cudaMemcpy(c,fc,1024*sizeof(float),cudaMemcpyDeviceToHost);

	for(int j=0;j<1024;j++)

		printf("\n%f",fc[j]);

		

	getch();

	return 0;

}

my vector size is 1024

but again it doesn`t give me a correct result and all results are 0.I’m confused. External Image

Your memory allocation is a complete mess. You are double allocating a,b,c and losing their contents in the process. The device memory allocations should be made to separate pointers. Further to that, all your cudaMemcpy callls look wrong - it seems you have the source and destination pointers reversed in every case.

Also your kernel sets the index to threadIdx.x and ignores which block the thread is in, so you will only add the first 64 elements, but you will repeat that process 16 times on the same 64 elements.