domain operator and optimal grid/block sizes

hi

I am new to cuda. I had some limited experience with brook, an open-source library for GPGPU computations and I managed to get a 35x boost using brook on a nvidia 8800GT card. Now I want to play with cuda a little bit and see if it can do better.

There is a domain() operator in brook, which allow one to select a subset of a stream (a texture), and apply a kernel function only to these selected elements, for example:

a_gpu_kernel(ins.domain(int2(2,2),int2(4,4)),outs.domain(int2(2,2),int2(4,4)));

will run the kernel only for a the elements between indices (2,2) to (4,4).
I am wondering if there is a similar operator in cuda.

Also, my kernel function is a very simple finite difference operator, it involves a few algebraic operations for each pixel, and all pixels are independent to each other. I am wondering if I want to translate to cuda, what’s the grid size and block size that I shall supply? should I use my array size (2D texture) as grid size and set block size to dim3(1,1,1)?

thank you