Non-recursive backtracking in CUDA

I want to port to CUDA a backtracking algorithm, written in a non-recursive manner.

The best manner to have a good load balanced implementation is to adopt a master/slave approach where the master control every other “slave”.
The slaves possess their own search loop. Each slave enumerates a certain part of the search tree. After enumerating all branches of its subtree,
the slave should tell the master that he finished his own work and ask the master for more work.
And here one of slave have to split and send part of his work to the idle slave. The search tree of the old task is shrunken.

This kind of paradigm can be implemented in PVM or MPI but is it possible in CUDA?

Thanks.