Combination of Task Parallelism and Data Parallelism.

I am using an Quadro FX 880 card. In my Image Segmentation code, I have divided the image into 4 parts(ie if there are 4000 pixels, each part is of 1000 pixels). I have 8 kernels in my code… first 4 of which are to be executed in parallel and the next four again in parallel but after the first four kernels get executed. Is this possible if I use same command queue for all 8 kernels and specify a clEnqueueNDRangekernel command for each of the first four kernels and I mention OUT_OF_ORDER argument while creating the command queue…? And if this is possible how to execute the next four kernels in parallel, that are to executed after first four kernels…? Can I give a clWaitForEvents command after the first four kernels and then specify the next four kernels…? will this guarantee that the first four kernels are executed in parallel and the next four are executed after them but in parallel…?
I think clEnqueueTask would make my code slow, since I have about 1000 pixels in each kernel and clEnqueueTask allows gobal_workitem_size and the local_work_item_size to be just 1…!
I am not sure whether all these things can be done…and what is wrong or right… so I just need a confirmation…! But if not in this way please suggest an alternative way…!