I have an application that will have 100 thread blocks, and 12 threads in each thread block.
The main computation in each thread block is reduction, and there will be a final reduction over the single result of each thread block.
So, in the first level there will be a reduction over an array of size 12 (per thread block), and at the second level there will be reduction over an array of size 100.
I doubt this application would be a good match for GPU. Any comments or suggestions is very much appreciated.
Not enough threads in a block.
12 threads in a block will still use the same amount as 32 threads (or even more as you can have maximum of 8 blocks per multiprocessor).
If I was you I would try reorganising your work to have blocks of 64 or even 128 threads inside.
Since I have 100 blocks, then maximum of 8 blocks will be assigned to each SM at each time. So, there will be enough warps to hide memory latency.
The point is, we don’t have the best utilization of the hardware. But, does this mean GPU will fail for this application compared to CPU?
Not necessarily. It does, however, mean that you will be sacrificing an awful lot of potential performance from the GPU by doing so.
The current CUDA capable GPUs schedule threads in groups (“warps” in CUDA speak) of 32 threads. Ideally you want the threads per block to be a multiple of 32 for scheduling efficiency reasons. Otherwise, many processor cores are likely to sit idle and the ability to hide global memory latency during kernel execution will be greatly reduced.
How much computation is needed to reduce the arrays? If it’s just some multiplication/addition and removing empty entries, then the CPU could probably do it in the time it takes to send the data over the PCIe bus in the first place!