how to loop over cuda kernel

hello !

i am using cupy and have written some functions which run over very large arrays on the GPU and are working super fast with great GPU thread utilisation.

my program is calculating a strategy in an adversarial game so having run the functions on the GPU i then need to switch players and run them again. every time i switch players and re-run the strategy gets better and convergences some more. ideally i therefore need to repeat the steps many times for either a set number of iterations or for a set amount of time.

my problem is having written the fast GPU functions I’m now stuck as to how to continue running them again and again until convergence. at present i just have a python loop for this and that’s super slow.

any help would be really appreciated :-)

A function in c/c++ that calls the GPU multiple times would probably help.
Some pseudocode would also help.
I have not used cupy, but I think you’ll need some kind of Python-C++ interface instead of Python-CUDA.