Garbage Value in Square Square of integers 512 onwards are garbage

Hi,

Following is the code and output of a program (attached screenshot)that prints the square of first 1000 integers (0 to 999). The desired out is square of 0 to 999 Number. But this program produces correct Square of only 0 to 511, after which the results are garbage values (-1163005939) See the attached screenshot. I am running the program on Visual C++ Compiler on windows XP, in emulation mode. I have declared two blocks with 512 thread per block as can be seen in the following code.

Could you please tell me why I am getting wrong results for integers above 511?

Thanking in advance ,

Deepak

__global__ void Squar(unsigned int *p)

{

	 unsigned int i=threadIdx.x;

	p[i]=i*i;

	

	

}

int main()

{

		unsigned int i,*h,*q;

		const unsigned  int p=10000;

		size_t size=p*sizeof(unsigned int);

		h=( unsigned int *)malloc(size);

		cudaMalloc((void**)&q,size);

		cudaMemcpy(q,h,size,cudaMemcpyHostToDevice);

		Squar<<<2,512>>>(q);

		cudaMemcpy(h,q,size,cudaMemcpyDeviceToHost);

		for(i=0;i<1000;i++)

		{

			printf("\n%ld",h[i]);

		}

		getch();

		free(q);

		free(h);

		return 0;

}

Your thread indexing is wrong. You aren’t writing to element 512. You want:

unsigned int i= blockIdx.x * blockDim.x + threadIdx.x;

Each block has 512 threads numbered 0 to 511 by threadIdx.x.

The secon block has the same numbering as the first… threadIdx gives the thread number of the block, not the kernel.

If you want to create a value per thread in the kernel, use a numbering like:

int kerneltid=threadIdx.x+blockIdx.x*blockDim.x;

p[kerneltid]=kerneltid*kerneltid;

In your example you’ll get threads with kerneltid going from 0 to 1023.