Sync threads

I have a while loop executed by each thread.
After some iterations a thread exits the loop when the condition of the while is not met.
Other threads may keep running the loop.

I would like to use the threads that exited the loop to help those that are still doing the loop.

For instance suppose 999 threads exited finished their task and exited the loop. And two threads remain running the loop.
I’d like each of the remaining two threads use the 999 threads to do some work.

Is there a way of freeing the 999 threads, I.e., terminating them, and let each of the remains two threads launch new threads to help finish the work?

You won’t be able to terminate and launch new threads this way. However it should be possible to redistribute work on the fly from some threads to other threads. This is just a load balancer, and a simple way to do it would be to have a work queue that all threads select work from. All threads keep working until the queue is empty. Atomics can help with this. I’m sure there are many possible approaches, this is just one idea.

Can you suggest a source with some examples of code that does this (for learning)

Actually, here is an idea I a thinking should do:
I will launch from the host blocks with one thread. And this thread will run a loop, sequentially, such that at each stage of the loop it will call secondary kernels by launching child threads (as part of what is referred to as CUDA dynamic parallelism). So the child threads do the parallel work on each block while the main thread on each lock does the decision making and invokes functions as secondary kernels

I guess there are other ways, but I didn’t have time to think more about it,