Missing or broken MMX intrinsics

I’m working on a project to implement portable versions of SIMD intrinsics, or use the native versions if available. One of the byproducts of this is that I’m slowly creating tests for each intrinsic. So far I’ve only finished MMX, but I’ve been having some trouble with PGI 16.10 with tests that work in gcc, clang, icc, and msvc.

Edit 2017-04-17: I made some progress on SSE over the weekend, and I’ve run into a few problems with SSE, too. I’ve added them to the list.

A few functions don’t seem to be implemented in PGI:

  • _mm_cvtm64_si64
  • _mm_cvtsi64_m64
  • _mm_cvtsi64_ss
  • _mm_cvttss_si64
  • _m_to_int64

_mm_cvtsi64x_ss and _mm_cvttss_si64x exist, and AFAICT are equivalent to _mm_cvtsi64_ss and _mm_cvttss_si64, but aren’t actually part of Intel’s API. MSVC and gcc have them, but clang and ICC don’t; all 4 have the Intel version.

Additionally, several functions don’t seem to be functioning as expected:

  • _mm_slli_pi16
  • _mm_slli_pi32
  • _mm_srli_pi16
  • _mm_srli_pi32
  • _mm_srli_si64
  • _mm_srai_pi16
  • _mm_srai_pi32
  • _mm_cmpge_ss
  • _mm_cmpgt_ss
  • _mm_cmpnge_ss
  • _mm_cmpngt_ss
  • _mm_cmpunord_ss

The MMX functions all seem to have the same problem (I get 0 instead of the expected result), and they’re all similar functions which shift an __m64 by an int. Here is a quick test for _mm_slli_pi16:

#include <mmintrin.h>
#include <assert.h>
#include <stdlib.h>

int main(void) {
  const struct {
    __m64 a;
    int count;
    __m64 r;
  } test_vec[8] = {
    { _mm_set_pi16(0xcb19, 0x18d8, 0xfae6, 0xe8c4),
      6,
      _mm_set_pi16(0xc640, 0x3600, 0xb980, 0x3100) },
    { _mm_set_pi16(0x196a, 0x908b, 0x0f94, 0x8616),
      10,
      _mm_set_pi16(0xa800, 0x2c00, 0x5000, 0x5800) },
    { _mm_set_pi16(0x4bbc, 0xee58, 0x256e, 0x2b3b),
      9,
      _mm_set_pi16(0x7800, 0xb000, 0xdc00, 0x7600) },
    { _mm_set_pi16(0x2ee0, 0x70cc, 0x748a, 0xca52),
      13,
      _mm_set_pi16(0x0000, 0x8000, 0x4000, 0x4000) },
    { _mm_set_pi16(0x1228, 0xf799, 0x97ef, 0x93f5),
      13,
      _mm_set_pi16(0x0000, 0x2000, 0xe000, 0xa000) },
    { _mm_set_pi16(0xf6cf, 0x4f5d, 0x1d02, 0x60d4),
      8,
      _mm_set_pi16(0xcf00, 0x5d00, 0x0200, 0xd400) },
    { _mm_set_pi16(0xe7e2, 0x7b04, 0x6f9f, 0xb061),
      1,
      _mm_set_pi16(0xcfc4, 0xf608, 0xdf3e, 0x60c2) },
    { _mm_set_pi16(0x895d, 0x43b6, 0x097c, 0xee32),
      5,
      _mm_set_pi16(0x2ba0, 0x76c0, 0x2f80, 0xc640) }
  };

  for (size_t i = 0 ; i < (sizeof(test_vec) / sizeof(test_vec[0])); i++) {
    __m64 r = _mm_slli_pi16(test_vec[i].a, test_vec[i].count);
    _mm_empty();

    short* rp = (short*) &r;
    short* xp = (short*) &(test_vec[i].r);
    assert(rp[0] == xp[0]);
    assert(rp[1] == xp[1]);
    assert(rp[2] == xp[2]);
    assert(rp[3] == xp[3]);
  }

  return EXIT_SUCCESS;
}

There are tests for the other intrinsics in my project (see test/test-mmx.c), if you need me to provide stand-alone versions I can.

To see what’s going on, I looked at the assembly generated for something like

#include <mmintrin.h>

__m64 foo(__m64 a, int count) {
  return _mm_slli_pi16(a, count);
}

For clang (gcc and icc are pretty similar), I get:

movd %edi, %mm0
movdq2q %xmm0, %mm1
psllw %mm0, %mm1
movq2dq %mm1, %xmm0
retq

But for PGI:

mov %esi,-0xc(%rsp)
mov %rdi,-0x8(%rsp)
movq -0x8(%rsp),%mm0
psllw -0xc(%rsp),%mm0
movq %mm0,-0x8(%rsp)
mov -0x8(%rsp),%rax
retq

I haven’t looked at the SSE functions closely yet, but if you want I can try to put together some stand-alone test cases, or you can just grab a copy of SIMDe and run the tests yourself.

I have replicated your behavior and we have logged the issue as
TPR 24170.

dave

Another missing function, this time in SSE2: _mm_cvtsi64_si128 (_mm_cvtsi64x_si128 exists, though).

Also, _mm_min_epi8 (from SSE 4.1) is missing.

I have added this additional information to the TPR 24170.
dave

Thanks Dave. Sorry to keep bumping this topic; if there is a better way to report issues please let me know.

Anyways, I’m running into a new issue now. I’m trying to add support for _mm_shuffle_pi16 and _mm_shuffle_ps, but the compiler gets stuck (both locally and on Travis CI). FWIW, GCC, clang, and Intel are okay with the code.

I haven’t had much success with putting together a minimal test case; is there something I can do to save intermediate files (like --save-temps with GCC) so I can at try to reproduce with a single file?

Another missing intrinsic from SSE: _mm_undefined_ps.

FWIW it was missing in GCC until 4.9, too.

Edit: there are actually a bunch of these undefined intrinsics, and AFAICT none of them are available in pgcc.

I have added your information to the TPR. I believe we are aware of what is
missing, and it is a matter of work to get them done. But giving user voice to
the missing routines helps its visibility and importance.

dave

FWIW, the missing intrinsics seem to be mostly functions which are only available on 64-bit.

Anyways, I found a few more functions which generate incorrect results (with 17.4 community edition). _mm_mulhi_epu16 generates the wrong instruction, and _mm_cvtsd_f64 returns the second element instead of the first, the fixes are trivial:

--- emmintrin-orig.h	2017-05-07 20:32:21.726746806 -0700
+++ emmintrin.h	2017-05-08 19:05:51.769286452 -0700
@@ -758,7 +758,7 @@
 
   __u.__v = __A;
 
-  return __u.__a[1];
+  return __u.__a[0];
 }
 
 /* Create the vector [Y Z].  */
@@ -2360,7 +2360,7 @@
 ATTRIBUTE __m128i
 _mm_mulhi_epu16(__m128i __A, __m128i __B)
 {
-  __asm__("PMULHW %1, %0" : "=x"(__A) : "x"(__B), "0"(__A));
+  __asm__("PMULHUW %1, %0" : "=x"(__A) : "x"(__B), "0"(__A));
   return __A;
 }

_mm_mul_su32 also doesn’t work as expected. Here is a quick test case which works with gcc and clang but not pgcc:

#include <emmintrin.h>
#include <assert.h>
#include <limits.h>
#include <stdint.h>
#include <inttypes.h>

int main(void) {
  __m64 a = _mm_set_pi32(INT_MAX, INT_MAX);
  __m64 b = _mm_set_pi32(INT_MAX, INT_MAX);
  __m64 r = _mm_mul_su32(a, b);
  uint64_t *rp = (uint64_t*) &r;

  assert(*rp == UINT64_C(0x3fffffff00000001));

  return 0;
}

I’m going to stop posting these here regularly and instead just do one post per ISA extension after I’m done implementing it. I’ve also started tracking these at PGI C Compiler bugs · Issue #12 · simd-everywhere/simde · GitHub (which I will keep up-to-date).