Faster __vsubus4() implementation

The Kepler SIMD video instructions, and by extension, the platform-independent SIMD-in-a-word device intrinsics supported by CUDA are useful tools in protein sequencing. See for example:

Yongchao Liu, Adrianto Wirawan and Bertil Schmidt. “CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions”. BMC Bioinformatics, 2013, 14:117 (online at http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-117)
Hanyu Jiang and Narayan Ganesan. “CUDAMPF: a multi-tiered parallel framework for accelerating protein sequence search in HMMER on CUDA-enabled GPU”. BMC Bioinformatics, 2016, 17:106. (online at http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0946-4)

The operations typically used by such software are saturated SIMD addition, saturated SIMD subtraction, and SIMD maximum / minimum. While Kepler supports these operations in hardware, they are emulated on other architectures.

In most cases this emulation is quite competitive, especially since the introduction of the LOP3 instruction with Maxwell. However, I noticed that the CUDA 7.5 implementation of __vsubus4() can be improved upon by replacing it with the following:

__device__ uint32_t my_vsubus4 (uint32_t a, uint32_t b)
{
    uint32_t r, t;
    t = __vcmpgtu4 (a, b);
    r = (a & t) - (b & t);
    return r;
}

On non-Kepler architecture this saves three instructions compared with the CUDA 7.5 implementation. I have filed an RFE (bug 1784493) for incorporation of this improvement into future CUDA versions.

Bug 457860 back in 2014 documents another common inefficiency, likely fixed by now.

There are actually two implementations of the video instructions in CUDA… one in the toolkit ( im device_functions.h expanding the CUDA C video intrinsic inline into non-intrinsic code which sm_1x and sm_2x can handle ) and one in ptxas (converting sm_3x and sm_5x PTX video instructions into sm_5x SASS).
The problem was that the SASS code generated for sm_5x to emulate the PTX video instructions was often much slower than the C header code (used for sm_2x) to emulate the C intrinsic with non-video PTX code.

For example, for VABSDIFF4, sm_50 converts the VABSDIFF4 PTX instruction into 108 lines of SASS code. Copying the C intrinsic code from the sm_20 branch in device_functions.h header compiles to PTX which compiles to 14 SASS instructions and executes proportionally faster.

So if you’re using any video instructions in current code, you might make an experiment and manually copy code from the video intrinsic header file instead of letting CUDA do it, just to compare the performance of both versions. Repeated caveat: the SASS generation is likely improved since this discrepency was reported two years ago, but it’s still interesting to compare the two versions.

CUDA 8 drops sm_1x and sm_2x, so I wonder if this emulation header has been stripped and simplified. If so, check out the CUDA 7.5 version.

By the way, the device_functions.h header is a an excellent way to study inline PTX. The dozens of intrinsic emulations are really well written and documented line by line. And just as interesting, the expanded IEEE functions provide an entire floating point education, with tons of examples of denormal, NaN, and infinity handling, plus rounding modes, flush to zero, etc.

Norbert, I bet much of this code is yours. Kudos to the author(s) regardless!

CUDA’s SIMD-in-a-word intrinsics, including the CUDA-level emulation for non-Kepler platforms were pretty much all my work, correct. The code may have changed after I left NVIDIA, of course. Side remark: sm_1x support is already gone in 7.5 (I think 7.0 as well but I never used that version :-)

Most of the CUDA-level emulation code is identical between sm_2x and sm_5x, for the most part relying on the compiler to generate LOP3, funnel shifts etc as appropriate. Best I can tell, it is doing a good job. I can’t expect it to find algorithmic optimizations that I myself overlooked when I wrote the code :-) When assessing the competitiveness of SIMD intrinsic emulation one should keep in mind that Kepler’s SIMD video instructions have 1/4 throughput.

It should be noted that each of the SIMD intrinsics exposes exactly one flavor of one video SIMD instruction. But if you look at the instructions themselves, they are much more complicated than what is exposed by the intrinsics. The instructions often support four or five specialized variants. PTX-level emulation has to support all that complexity post-Kepler. That makes efficient emulation hard, and as a consequence, PTX-level emulation of the Kepler SIMD instructions tends to be slower than the emulation built into the intrinsics. That does not mean there may not be room for improvements in PTX-level simulation, I have not looked into the details of that.

TL;DR: For performance reasons programmers should use the SIMD device intrinsics if at all possible, not use the SIMD instructions in inline PTX. That is why I created these intrinsics in the first place.

[Later:] For the record: Using the CUDA 7.5 toolchain and compiling for sm_50, I find that __vabsdiffu4() is in fact mapped to a native SASS instruction:

VABSDIFF4.U8.U8 R0, R0, R5, RZ;

For sm_20, the intrinsic compiles to an 11-instruction emulation sequence:

LOP.XOR R0, ~R3, c[0x0][0x20];
LOP32I.AND R2, R0, 0xfefefefe;
LOP.OR R0, ~R3, c[0x0][0x20];
SHR.U32 R2, R2, 0x1;
ISUB R0, R0, R2;
LOP.XOR R2, R3, c[0x0][0x20];
PRMT R0, R0, 0xba98, RZ;
LOP.AND R0, R0, R2;
LOP.XOR R2, R0, c[0x0][0x24];
LOP.XOR R0, R0, c[0x0][0x20];
ISUB R0, R2, R0;

Could you double-check on your end whether you are seeing something different? In any event, whatever performance bug may have existed in the past with regard to that intrinsic, it seems fixed now.

I don’t think CUDA 8 drops sm_2x, unless by “drop” you mean “deprecate”, but I may be misinterpreting that comment in any number of ways.