I started learning cuda recently, I am writing a genetic algorithm in python using numba cuda. the problem is according to the island module some arrays/chromosomes migrate/transfer between sub populations. In other words can cuda transfer elements from block to another if I specified block to each subpopulation.
Since CUDA blocks can both write and read global memory, data transfer is not a problem.
The problem is synchronizing data transfer between 2 blocks when the order of execution is unknown and system controlled and still maintaining high performance.
Most programs will execute kernels, do some brief analysis and execute kernels again with the new data.
Like robosmith mentioned, you’re going to be using global memory to share data between blocks.
Most of the time, this sort of thing is synchronized by a new kernel launch. Each time data written by one block needs to read by another, that’s the end of your kernel launch. You write the data, end the launch, and launch again.
They added grid synchronize with cooperative kernel launches, which allows you to make sure every block in a run is at the same point, similar to the block and warp syncs. This combined with thread fences which ensure all writes are also done, can probably get you what you want in a single kernel launch. Check out the samples ending in CG in 6_advanced, like reductionMultiBlockCG.