Is this a good match for GPU?

gpugpu · June 11, 2009, 6:21am

I have an application that will have 100 thread blocks, and 12 threads in each thread block.

The main computation in each thread block is reduction, and there will be a final reduction over the single result of each thread block.
So, in the first level there will be a reduction over an array of size 12 (per thread block), and at the second level there will be reduction over an array of size 100.

I doubt this application would be a good match for GPU. Any comments or suggestions is very much appreciated.

Thanks

Cygnus_X1 · June 11, 2009, 7:22am

Not enough threads in a block.
12 threads in a block will still use the same amount as 32 threads (or even more as you can have maximum of 8 blocks per multiprocessor).
If I was you I would try reorganising your work to have blocks of 64 or even 128 threads inside.

gpugpu · June 11, 2009, 8:01am

I wanted to ensure that when we say not enough threads in a block, it means we will have a poor performance compared to CPU, is that right?

Thanks.

gpugpu · June 11, 2009, 8:08am

Another comment:

Since I have 100 blocks, then maximum of 8 blocks will be assigned to each SM at each time. So, there will be enough warps to hide memory latency.
The point is, we don’t have the best utilization of the hardware. But, does this mean GPU will fail for this application compared to CPU?

Thanks

avidday · June 11, 2009, 8:15am

Not necessarily. It does, however, mean that you will be sacrificing an awful lot of potential performance from the GPU by doing so.

The current CUDA capable GPUs schedule threads in groups (“warps” in CUDA speak) of 32 threads. Ideally you want the threads per block to be a multiple of 32 for scheduling efficiency reasons. Otherwise, many processor cores are likely to sit idle and the ability to hide global memory latency during kernel execution will be greatly reduced.

Keldor314 · June 11, 2009, 9:20am

How much computation is needed to reduce the arrays? If it’s just some multiplication/addition and removing empty entries, then the CPU could probably do it in the time it takes to send the data over the PCIe bus in the first place!

Topic		Replies	Views
Number of thread blocks and threads in those, difference for performance? CUDA Programming and Performance	1	383	September 6, 2021
finding the best number of threads per block CUDA Programming and Performance	3	7848	January 29, 2010
efficiency of block/thread ratios CUDA Programming and Performance	2	3818	April 18, 2007
Performance in different thread-block schemes CUDA Programming and Performance	5	2348	September 19, 2008
Optimisation Strategies when running out of shared memory CUDA Programming and Performance	1	555	March 12, 2011
General Formula for Thread/Block Ratio CUDA Programming and Performance	1	593	June 2, 2011
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7010	January 30, 2008
2 blocks versus 3 blocks CUDA Programming and Performance	5	4917	August 3, 2009
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8350	February 12, 2008
blocks vs threads and bad CUDA performance CUDA Programming and Performance	3	3554	January 23, 2015

Is this a good match for GPU?

Related topics