if I want to have, in each multiprocessor, some threads that calculate and one thread that organizes and exchanges data with other MPs, how could I do that? If I do it naively
with branching depending on the thread number, will this thread divergence thing catch me? Does the thread divergence thing apply only to one warp, or will it result in all branches being executed in each thread for the rest of the kernel runtime?
I don’t think it is possible. What I do is to use the number of blocks as a constant, and create a global variable with that size, say split_value[BLOCK_NUMBER]. Later, in each block, after the work have been done, one thread saves the required in split_value[MY_BLOCK_VALUE]. Later call a global function with 1 block and BLOCK_NUMBER threads, and work with that.
You can “exchange” data between block via the threadfence function. This means that the work is stalled until the value iun global memory becomes visible to all blocks. The Programming guide has a nice example with summation. So all thread do the work, then at some point save the data to global memory and use threadfence(). There is no way to “send” data like you would do in mpi. The easiest way is to divide the work in multiple kernels call.
So, the answer to your question is that you can’t. If the size of the warp is 32, then maybe you could work blocks with 1024 threads (if your architecture allows so), and then uses the first 32 threads to administrate each one groups of 32 threads inside the same block, but it sounds messy.