A good general starting point:
(1) Each thread is responsible for producing one output element
(2) Chose between 128 and 256 threads per thread block (multiple of 32)
(3) Make a 1D grid that comprises enough blocks that the total number of threads covers all output elements
This has been covered in these forums multiple times. Of course, numerous variants and modifications are possible based on the details of the processing. For example 2D grids may be more naturally suited to the processing of 2D images, where each thread block produces one tile of pixels in the image.