if I want to have, in each multiprocessor, some threads that calculate and one thread that organizes and exchanges data with other MPs, how could I do that? If I do it naively
with branching depending on the thread number, will this thread divergence thing catch me? Does the thread divergence thing apply only to one warp, or will it result in all branches being executed in each thread for the rest of the kernel runtime?


I don’t think it is possible. What I do is to use the number of blocks as a constant, and create a global variable with that size, say split_value[BLOCK_NUMBER]. Later, in each block, after the work have been done, one thread saves the required in split_value[MY_BLOCK_VALUE]. Later call a global function with 1 block and BLOCK_NUMBER threads, and work with that.

You can “exchange” data between block via the threadfence function. This means that the work is stalled until the value iun global memory becomes visible to all blocks. The Programming guide has a nice example with summation. So all thread do the work, then at some point save the data to global memory and use threadfence(). There is no way to “send” data like you would do in mpi. The easiest way is to divide the work in multiple kernels call.

What I mean is: if I use the “one thread in each block does X, the rest does Y” thing, will actually all the code be executed by all threads for the runtime of the whole kernel

because of thread divergence?

Lock steps are at warp level. If you can ensure no thread divergence within one warp, you don’t have thread divergence at all.

So, the answer to your question is that you can’t. If the size of the warp is 32, then maybe you could work blocks with 1024 threads (if your architecture allows so), and then uses the first 32 threads to administrate each one groups of 32 threads inside the same block, but it sounds messy.

I have a Monte Carlo code in which for 2 particles I calculate the distance and then I have the condition if(r<rcut) {calculate energy}, this can be used, but with some penalty in execution time.

Can you give more details about the algorithm you are trying to implement=

I now saw that such a management thread wouldn’t increase performance.

I want to program an evolutionary algorithm: Each thread should simulate one world in which ones’ candidate fitness is evaluated. Then, the system looks what

program has the best fitness and promotes it.