Confused about number of threads, block, grid... My first CUDA app

When you’ve got an array full of data:

unsigned int array[100000]

and a kernel which takes this array as an argument

__global__ void myKernel(int* theArray, int* results)


		  unsigned int i = threadIdx.x;

		  results[i] = theArray[i] * theArray[i];


How do you set up the kernel parameters so that it executes the kernel for each element in “array” like the sequential version below?

void sequental(int* theArray, int* results]


		 for(unsigned int i = 0; i < 100000; i++)


					 results[i] = theArray[i]*theArray[i]



The way to think about it is that the explicit for loop in your last code box is now implicitly done by the hardware.
You will effectively launch 100 000 threads, each doing its work on its assigned peice of data, as you have pointed out in the second code box.

There is no one way to setup the grid and thread block parameters. All you need to worry about is that, in the end, you have at least 100 000 threads running.

You could, for example, have thread blocks of 256 threads and a one dimensional grid of 391 thread blocks. As ive said, this is one of many possible configurations.

Ok, this is starting to make sense to me. Thanks for the help!