Kernel computation in Host Loop Question

Hi

I want to know using kernel function with host function simultaneously to reduce elapsed time.

example

for(;;)
{
//
  computation in Host (elapsed time 400us)
//

//
  computation in Kernel (elapsed time 400us)
//
}

if i want to make this code work with under 500us(1 cycle)

How can I do ?

Thanks.