The Kepler SIMD video instructions, and by extension, the platform-independent SIMD-in-a-word device intrinsics supported by CUDA are useful tools in protein sequencing. See for example:
Yongchao Liu, Adrianto Wirawan and Bertil Schmidt. “CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions”. BMC Bioinformatics, 2013, 14:117 (online at http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-117)
Hanyu Jiang and Narayan Ganesan. “CUDAMPF: a multi-tiered parallel framework for accelerating protein sequence search in HMMER on CUDA-enabled GPU”. BMC Bioinformatics, 2016, 17:106. (online at http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0946-4)
The operations typically used by such software are saturated SIMD addition, saturated SIMD subtraction, and SIMD maximum / minimum. While Kepler supports these operations in hardware, they are emulated on other architectures.
In most cases this emulation is quite competitive, especially since the introduction of the LOP3 instruction with Maxwell. However, I noticed that the CUDA 7.5 implementation of __vsubus4() can be improved upon by replacing it with the following:
__device__ uint32_t my_vsubus4 (uint32_t a, uint32_t b)
{
uint32_t r, t;
t = __vcmpgtu4 (a, b);
r = (a & t) - (b & t);
return r;
}
On non-Kepler architecture this saves three instructions compared with the CUDA 7.5 implementation. I have filed an RFE (bug 1784493) for incorporation of this improvement into future CUDA versions.