MyFirstCuda

kasra515 · February 9, 2010, 11:05pm

Hi.Im a amateur cuda programmer.I wrote a program that adds two vectors but it doesnt work correctly and give to me an incorrect result.I did upload my file.
im using visual studio 2008 . I havent GPU and i`m using DebugSimulator. Please say me where is the problem.and i have problem with concept of definitions of blocks and grids and importance of them.please help me.thanks :-)
sample.cu (2.57 KB)

kasra515 · February 10, 2010, 9:09am

Please answer me :unsure:

You7878 · February 10, 2010, 9:45am

You are launching kernel on grid with 1 block of size 1x1. so u will get only sum of first elements.
U have to split your vector size (10 in your example) into several blocks. block size also should be multiple of 32.
for example: Vector Size = 1024
then VecAdd<<<16x1, 64x1>>> where 16 is number of blocks and 64 is size of block (1024 = 16 * 64)

kasra515 · February 10, 2010, 12:04pm

I changed my code to :

__global__ void VecAdd(float* A, float* B, float* C)

{ 

	int i = threadIdx.x; C[i] = A[i] + B[i];

} 

int main(int argc, char* argv[])

{

	float *a , *b ,*c;

	a = new float[1024]; b = new float[1024];c = new float[1024];

	dim3 block(1,1),grid(1,1);

	float fa[1024],fb[1024] , fc[1024];

	for(int j=0;j<1024;j++)

	{

		a[j]=b[j]=1;

		c[j] = 0;

	}

	cudaMalloc((void **)&a,1024*sizeof(float));

	cudaMalloc((void **)&b,1024*sizeof(float));

	cudaMalloc((void **)&c,1024*sizeof(float));

	

	cudaMemcpy(a,fa,1024*sizeof(float),cudaMemcpyHostToDevice);

	cudaMemcpy(b,fb,1024*sizeof(float),cudaMemcpyHostToDevice);

	cudaMemcpy(c,fc,1024*sizeof(float),cudaMemcpyHostToDevice);

	

	VecAdd<<<16, 64>>>(a,b,c);

	cudaMemcpy(c,fc,1024*sizeof(float),cudaMemcpyDeviceToHost);

	for(int j=0;j<1024;j++)

		printf("\n%f",fc[j]);

		

	getch();

	return 0;

}

my vector size is 1024

but again it doesn`t give me a correct result and all results are 0.I’m confused. External Image

avidday · February 10, 2010, 12:13pm

Your memory allocation is a complete mess. You are double allocating a,b,c and losing their contents in the process. The device memory allocations should be made to separate pointers. Further to that, all your cudaMemcpy callls look wrong - it seems you have the source and destination pointers reversed in every case.

anthonyfmorse · February 11, 2010, 1:15pm

Also your kernel sets the index to threadIdx.x and ignores which block the thread is in, so you will only add the first 64 elements, but you will repeat that process 16 times on the same 64 elements.

Topic		Replies	Views
cudaMemcpy don't work CUDA Programming and Performance	4	1791	July 3, 2015
nan in simple vector addition CUDA Programming and Performance	7	2409	December 13, 2012
Why it doesnt work ? Simple program that adds two vectors CUDA Programming and Performance	6	3890	March 18, 2010
HELP with vector sum CUDA Programming and Performance	6	2215	May 11, 2010
help for my cuda code Teaching and Curriculum Support	2	3890	March 31, 2015
Vector Reduction CUDA Programming and Performance	3	19696	March 9, 2011
VectorAdd example from CUDACast #2 CUDA Programming and Performance	3	799	August 20, 2014
kernel problem CUDA Programming and Performance	6	2756	August 15, 2008
My first program it doesn't behave as expected CUDA Programming and Performance	2	2493	July 19, 2009
vector limits in cuda CUDA Programming and Performance	1	1780	March 16, 2010

MyFirstCuda

Related topics