16 bit int multiplication using SIMD / mixed precision

I am writing code where each thread has to calculate simple formulas on integers such as unsigned short result = x * y + z - v * w;

All of the input variables fit in 16 bit integer as well as the result. I have found there exist SIMD instructions for integers, which has the addition and subtraction but lacks the multiplication. I have also found about mixed precision, where you can do a dotproduct of multiple lower precision integers.

Is it somehow possible to 2 multiplications simultaneously, like you can do with a SIMD addition. This would mean I can calculate 2 results using the formula at the same time, thus increasing performace.

I will be running the code on a v100.

You may want to study this table.

32-bit integer arithmetic instructions (what will happen without you doing anything special) will likely be the fastest path for general 16-bit integer add/subtract/multiply arithmetic.

The SIMD instructions won’t be faster, and they bring noticeable code complexity.
32-bit float has the same thoughput as 32-bit int (could be an option for 16-bit work, for example if your code is integer bound, or exhibits significant integer pipe pressure)
TensorCore on V100 has no suitable paths (16-bit float support only).

If you can get down to 8-bit arithmetic, there is a dp4a instruction that is 4x the FP32 rate (when considering ops/s not instructions/s). But this doesn’t help much with general arithmetic.

1 Like

While the sample computation requires just two 32-bit IMAD instructions, depending on context this code will incur overhead for 16-bit to 32-bit zero extension operations. You would want to take a closer look at the generated code with cuobjdump --dump-sass. If you find lots of conversion overhead, consider using 32-bit integers instead.

C++ expression evaluation semantics require that all variables of integer types narrower than int get widened to int first. Obviously compilers can deviate from that under the “as-if” optimization rule (generated code behaves as if it is following the abstract execution model exactly) as long as the hardware offers instructions operating on narrow integer types.

For the most part GPUs do not offer such instructions. In general, all integer data wants to be int (or int32_t) unless there is a very good reason for it to be something else (that is, a narrower, wider, or unsigned integer type).

1 Like

Thank you for all the provided information!

I have 2 follow up questions if you don’t mind.

If SIMD instructions bring that much complexity, what would be a good situation to use them?

I could use 16-bit floating points instead for some of the calculations. As of right now everything is calculated using integers which might indeed put pressure on the integer pipe. Given that the v100 can execute int and float arithmetic simultaneously, can doing a part of the calculations using floats and the other part using integers improve the performance? (Of course calculations that depend on each other would not benefit from this)

Existing integer SIMD intrinsics in CUDA can be useful for processing byte-size data in (1) simple image processing tasks (2) processing of genomics data (e.g. Smith-Waterman). I am not aware of other use cases, which doesn’t mean they don’t exist. The way the SIMD intrinsics help improve performance for these use cases is by maximizing memory throughput and minimizing dynamic instruction count.

The two-way FP16 intrinsics mostly help extract the maximum number of FLOPS from the hardware. Personally, I remain highly critical of the use of FP16 for any kind of general-purpose computation, but there are many use cases where it can be useful for storage, e.g. for the processing of sensor data from physical processes which often have limited resolution based on AD conversion. There are use cases for FP16 computation in AI, but the introduction of the BFLOAT16 format tells me that it may be marginal even there.

1 Like

It might be possible. I did an optimization on a particular integer-only code, doing some range analysis, and moving some data from integer to floating point, and got a significant speed up. So I know its possible in some cases. However, in my case, it was a GPU that predated Volta, and it did not have native 32-bit integer multiply. Strangely enough, it was indexing calculations that I moved, so even the cost of integer->float->integer conversion was amortized, and still yielded a benefit. You can imagine that index calculations will often involve a multiply and also may be inherently easier to do range analysis on. Since Volta and beyond has “full rate” integer multiply, it might be harder to get an attractive benefit this way. I believe NVIDIA’s marketing materials more-or-less make this exact point, regarding integer arithmetic on Volta. So I wouldn’t put this at the top of your list without more evidence.

People often approach performance this way - trying to apply specific pieces of knowledge to see what will happen. I do it too. It’s not a horrible way to learn. You run experiments, then do your best to explain the results. However without knowing what your code is limited by, this could all be very academic, or irrelevant. And taking on a large code refactoring exercise using this sort of mentality might not be the best use of your time. For a large amount of expended effort, I’d want to have some assurance or likelihood of a payoff at the other end.

My advice when teaching CUDA is that you should have a few (possibly just two, but maybe as many as 10 or so) basic paradigms understood so that you “tend” to write performant code “naturally”. For everything else, you leave that to profiler guided performance analysis and optimization. Make aggressive use of library implementations where possible.

I can’t think of any of the top 10 or so paradigms - except possibly the suggestion by njuffa about using 16 bit packed data to make more efficient use of memory - that would apply to what we’re talking about here.

My suggestion would be to write your code in a way that seems natural, understandable, and maintainable to you, and then let the profiler guide you.

Going after compute-boundedness (mostly what’s being discussed here) without solid evidence is very often misguided, in my experience. Many people think the arithmetic in their code is important, when actually it is they way they use memory that is most important.

Obviously I can’t speak to your specific case, YMMV, take with a grain of salt, ignore if annoying.

1 Like

I concur with this observation. The way I like to phrase it, somewhat flippantly, is that “FLOPS are too cheap to meter”. Rough guiding principles like this can be helpful, but it is best to use the CUDA profiler to actually identify specific bottlenecks.