I was just wondering if anyone has encountered some kind of programming application or algorithm where they really wanted/needed to have some synchronization between blocks/multiprocessors? In my current application, I basically have a single block per multiprocessor on my device. However, sometimes I need the outputs of one of these blocks to provide the inputs to another block. In my application, I am basically running until all the blocks converge to a stable state before exiting the kernel, but this means there is an undetermined number of times my blocks will evaluate their inputs.
Since the dependency between blocks, initially I had just used multiple kernel calls (First the lowest set of inputs execute on the device and the kernel exits. The next highest tier then uses the outputs as inputs, etc). However, since I am running to convergence, this may mean a lot of overhead of kernel-calls. So instead, I have basically created a work-queue like structure. When Block A finishes its output, it schedules B, who uses Block-A’s outputs as its inputs. This works using atomic primitives, so long as you dont have more blocks than multiprocessors.
I guess I am interested if anyone else can think of some simple applications/algorithms that need to run to some undetermined number of times to convergence? Basically I would like to see if this kind of scheduling-queue using atomic primitives can be equally benefitial to some application other than my own. Any ideas would be greatly appreciate! Thanks!
In my opinion you are looking at the problem from wrong side. In the chapter 4 of programming guide you can read that the architecture of CUDA-capable devices is basically SIMD (single instruction multiple data). This means that for optimum performance all the threads launched must execute about the same code, but with different data.
So try thinking to split your problem in another way: try to “split” in some way your input data, with the objective of processing only one part of your data with one thread. You can place a loop in the kernel code that re-iterate the processing until you can reach the convergence you want. In this manner you can launch one thread per each block of input data and that thread operate on that data until it reaches the convergence.
I hope you have understand me, I’m not english and also I’m a CUDA newbie.