I have an application that will have 100 thread blocks, and 12 threads in each thread block.
The main computation in each thread block is reduction, and there will be a final reduction over the single result of each thread block.
So, in the first level there will be a reduction over an array of size 12 (per thread block), and at the second level there will be reduction over an array of size 100.
I doubt this application would be a good match for GPU. Any comments or suggestions is very much appreciated.