Please help me to build workflow

I have two kernels. foo1() should run in multiple instances, then wait for all instances of foo1() to finish, then foo2() should run in single instance. foo2() will make a decision if this is an end, if not, then multiple instances of foo1() should run again.

All this should not interact with host, guess overhead will be dramatic.