The method to call a kernel that no need for parallelization

my kernel program calculating the bounding rect of a gray image(the rect surround all the pixels whose gray level is 255),and then extent the rect for next calculation;because the bounding rect data is stored in device memory,so I write an another kernel to do the extent calculating,but the kernel is very simple and there is no need for any parallelization,so
I call the kernel with grid and block size are all 1,I’m not sure whether it is a good choice,so anyone can tell me other better method to execute a kernel that no need for parallelization?

Below is my code to do bounding rect calculation & rect extent calculation:

extern “C” global void get_mask_rect(const UCHAR alpha, int w,int h,int astep,int * LTRB)
int tx,ty;
ty + tx] == 255)

extern “C” global void extent_roi(const int * LTRB,int w,int h,const int WSZ,int * roi)
int l = LTRB[0] - WSZ;
int r = LTRB[2] + WSZ;
int t = LTRB[1] - WSZ;
int b = LTRB[3] + WSZ;
int BW = ((r-l) + (b-t))/16;
l = max(0,l-BW);
r = min(w-1,r+BW);
t = max(0,t-BW);
b = min(h-1,b+BW);
roi[0] = l;
roi[1] = t;
roi[2] = r-l+1;
roi[3] = b-t+1;

// the code to call kernel

Yeah this is generally how I compute serial operations on the device.

I can’t really think of a better way either, besides copying everything back to the host, which seems like it would be way less efficient.