Arithmetic intensity - How low do you go

Christen CSRD, 26(3):205-210 doi
hints that the ratio of data moved to calculation performed on it should
be 4 - 64 FLOP per data item. I have a problem where the serial algorithm
is less than 1 and these are integer rather than float data items.

Do people use Arithmetic intensity?
What values are you seeing?
Does the range 4 to 64 FLOP/TDE make sense to you?
How low can the ratio be and it still make sense to use a GPU?
Does integer v. float make any difference?

All comments and data welcome

The arithmetic intensity argument is a pretty simple one, and it is basis is pretty straightforward:

If you have a hypothetical device with 100 Gb/s of memory bandwidth, then you have at most 25 Gfloats/s of memory bandwidth. If the device had 625 GFLOP/s of peak single precision arithmetic throughput (that is pretty roughly a G200 type device, FWIW), then any code which achieved peak arithmetic and memory throughput would have to be performing 25 FLOP per memory transaction. If the code performs less than that, it will be memory bandwidth bound. The argument holds for integers as well (although the throughput peak numbers might be different for different integer operations compared to floating point).

Is having a low arithmetic intensity a bad thing? It probably depends. If an optimal operation is memory bandwidth bound, it means the potential speed up compared to another type of device will be (roughly) bounded by the ratio of the two devices memory bandwdith. In most cases that ratio is still about 5-10 times in favour of the GPU. There are an awful lot of problems were 5-10 times speed up is something worth expending considerable time and effort to achieve. In my work, sparse matrix operations fall into that category. But there are probably also a lot of cases where it might not be enough to justify the GPU. Only you can answer the question for whatever your particular application is.

Depends if the input data resides on host or device.

I usually analyze using the same metric but instead # FLOP / byte. But AI isn’t allways everything ( unless you’re trying to a hide a specific bandwidth bottleneck) since the GPUs have huge bandwidth advantages aswell.

In other words, problems with very low AI can be worthwhile depending on the big picture of what you are doing, example:

Reduction sum, I wrote code that could perform a reduction sum at around 150 GB/s on my GTX480, this is way faster than what I could ever do on my CPU.

My problem was 4096^2 * 4 bytes / element => 0.0671 GB / 150 GB/s => 0.44 ms to process.

However passing that amount of data took me roughly 11.1 ms, which was enough time for the CPU to process the data anyways. Hence if my data already resided on the GPU ram in a processing chain it was very worthwhile (~20x faster).