For example i have an image size 2400 x 2400. I want to divide them into sub-image of say 60 x 60. each pixel is independent. Each sub-image requires linear interpolation.

If i have a kernel for each sub-image and launch the kernel 1600 times, compare to launching a kernel which can process say 40 sub-images at a time. Anyone knows if there is significant performance difference between the 2? Will the launching of kernel have significant overhead?

Thanks for any advice.