Kernel design problem Performance difference in number of times a kernel is launched

For example i have an image size 2400 x 2400. I want to divide them into sub-image of say 60 x 60. each pixel is independent. Each sub-image requires linear interpolation.

If i have a kernel for each sub-image and launch the kernel 1600 times, compare to launching a kernel which can process say 40 sub-images at a time. Anyone knows if there is significant performance difference between the 2? Will the launching of kernel have significant overhead?

Thanks for any advice.

How long does the actual processing of a subimage take? The overhead of launching a kernel is tens of microseconds (depending on how many blocks you have). If the processing time per subimage is very short, you would be better off grouping them into fewer kernels.