How can I launch a kernel to provide one thread per pixel in an image?

I have a ray tracer that I am working on, and I was wondering how I could launch my kernel so that each pixel in the output image gets its own thread. The next logical question would be: inside of the kernel, how can I tell which pixel (as in the X and Y coordinates) I am generating a color for?

I appreciate any and all help provided, and thank you in advance.

You might take a look into the image processing examples of the CUDA SDK.
Also have a look at nvidia - Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) - Stack Overflow.

Below I have an example from my code, where each GPU threads handles exactly one pixel.

host code goes like this:

dim3 dimBlock=dim3(16,16);
int yBlocks=src.width/dimBlock.y+((src.width%dimBlock.y)==0?0:1);
int xBlocks=src.height/dimBlock.x+((src.height%dimBlock.x)==0?0:1);
dim3 dimGrid=dim3(xBlocks,yBlocks);
texKernel<<<dimGrid,dimBlock>>>(srcImgBuf, dstImgBuf, imgDimension, bitDepth)

Kernel:

global
void texKernel(srcImgBuf, dstImgBuf, dim, int iplDepth)
{
int x = blockIdx.xblockDim.x + threadIdx.x;
int y = blockIdx.y
blockDim.y + threadIdx.y;

if (y < dim.height && x < dim.width)
{
	// do something for pixel (x,y) 
    }

}

Awesome, thank you! The link was very informative, thanks for that as well. For some reason, when I tried rendering a 1024x768 image only the left square of 768x768 was drawn to. To fix that, I just set the kernel to render the next power of two dimension but do bounds checking to prevent drawing outside of the image.