I am creating a CPUGPU Evolutionary Algorithm Library, at the moment I have got my fitness function evaluating on the GPU, however I would like to implement a multi-deme style EA. With the CPU and GPU each evolving their own independent population with migration happening periodically between them.
So my question is, if i need a Kernel that requires continuous invocation, is it best to loop inside the CUDA kernel or on the host C code .
My intuition tells me, looping in the host code there:
Could incur a overhead every iteration due to launching the kernel
Could prevent asynchronous CPU GPU execution
Conversely, looping on the GPU
Would surpass the 5 second kernel limit
Coming to think of it, would prevent migration as the kernel would need to cease before Cudamemcpy could be called, right?
I actually did some work on simulated annealing using CUDA, and the limitations/conditions are pretty similar for both. Your assessment is pretty much correct–I had to go back to the CPU to evaluate whether or not to perform another iteration, and that was a fairly expensive operation that limited the amount of speedup I could get pretty intensely. Looping was tricky, and the only way to get it working 100% correctly would have been inter-block communication which I was not willing to play with. So, looping on the CPU is probably the best way to go. I would consider trying zero-copy, because that isn’t limited by cudaMemcpyAsync overlap or anything like that–it will actually do reads/writes across PCIe while the kernel is running.
(I am also keenly aware of the limitations imposed by CUDA in its current state for this kind of algorithm, and solving these problems is something we would like to do very much.)