I’m working on a project to implement portable versions of SIMD intrinsics, or use the native versions if available. One of the byproducts of this is that I’m slowly creating tests for each intrinsic. So far I’ve only finished MMX, but I’ve been having some trouble with PGI 16.10 with tests that work in gcc, clang, icc, and msvc.
Edit 2017-04-17: I made some progress on SSE over the weekend, and I’ve run into a few problems with SSE, too. I’ve added them to the list.
A few functions don’t seem to be implemented in PGI:
- _mm_cvtm64_si64
- _mm_cvtsi64_m64
- _mm_cvtsi64_ss
- _mm_cvttss_si64
- _m_to_int64
_mm_cvtsi64x_ss and _mm_cvttss_si64x exist, and AFAICT are equivalent to _mm_cvtsi64_ss and _mm_cvttss_si64, but aren’t actually part of Intel’s API. MSVC and gcc have them, but clang and ICC don’t; all 4 have the Intel version.
Additionally, several functions don’t seem to be functioning as expected:
- _mm_slli_pi16
- _mm_slli_pi32
- _mm_srli_pi16
- _mm_srli_pi32
- _mm_srli_si64
- _mm_srai_pi16
- _mm_srai_pi32
- _mm_cmpge_ss
- _mm_cmpgt_ss
- _mm_cmpnge_ss
- _mm_cmpngt_ss
- _mm_cmpunord_ss
The MMX functions all seem to have the same problem (I get 0 instead of the expected result), and they’re all similar functions which shift an __m64 by an int. Here is a quick test for _mm_slli_pi16:
#include <mmintrin.h>
#include <assert.h>
#include <stdlib.h>
int main(void) {
const struct {
__m64 a;
int count;
__m64 r;
} test_vec[8] = {
{ _mm_set_pi16(0xcb19, 0x18d8, 0xfae6, 0xe8c4),
6,
_mm_set_pi16(0xc640, 0x3600, 0xb980, 0x3100) },
{ _mm_set_pi16(0x196a, 0x908b, 0x0f94, 0x8616),
10,
_mm_set_pi16(0xa800, 0x2c00, 0x5000, 0x5800) },
{ _mm_set_pi16(0x4bbc, 0xee58, 0x256e, 0x2b3b),
9,
_mm_set_pi16(0x7800, 0xb000, 0xdc00, 0x7600) },
{ _mm_set_pi16(0x2ee0, 0x70cc, 0x748a, 0xca52),
13,
_mm_set_pi16(0x0000, 0x8000, 0x4000, 0x4000) },
{ _mm_set_pi16(0x1228, 0xf799, 0x97ef, 0x93f5),
13,
_mm_set_pi16(0x0000, 0x2000, 0xe000, 0xa000) },
{ _mm_set_pi16(0xf6cf, 0x4f5d, 0x1d02, 0x60d4),
8,
_mm_set_pi16(0xcf00, 0x5d00, 0x0200, 0xd400) },
{ _mm_set_pi16(0xe7e2, 0x7b04, 0x6f9f, 0xb061),
1,
_mm_set_pi16(0xcfc4, 0xf608, 0xdf3e, 0x60c2) },
{ _mm_set_pi16(0x895d, 0x43b6, 0x097c, 0xee32),
5,
_mm_set_pi16(0x2ba0, 0x76c0, 0x2f80, 0xc640) }
};
for (size_t i = 0 ; i < (sizeof(test_vec) / sizeof(test_vec[0])); i++) {
__m64 r = _mm_slli_pi16(test_vec[i].a, test_vec[i].count);
_mm_empty();
short* rp = (short*) &r;
short* xp = (short*) &(test_vec[i].r);
assert(rp[0] == xp[0]);
assert(rp[1] == xp[1]);
assert(rp[2] == xp[2]);
assert(rp[3] == xp[3]);
}
return EXIT_SUCCESS;
}
There are tests for the other intrinsics in my project (see test/test-mmx.c), if you need me to provide stand-alone versions I can.
To see what’s going on, I looked at the assembly generated for something like
#include <mmintrin.h>
__m64 foo(__m64 a, int count) {
return _mm_slli_pi16(a, count);
}
For clang (gcc and icc are pretty similar), I get:
movd %edi, %mm0
movdq2q %xmm0, %mm1
psllw %mm0, %mm1
movq2dq %mm1, %xmm0
retq
But for PGI:
mov %esi,-0xc(%rsp)
mov %rdi,-0x8(%rsp)
movq -0x8(%rsp),%mm0
psllw -0xc(%rsp),%mm0
movq %mm0,-0x8(%rsp)
mov -0x8(%rsp),%rax
retq
I haven’t looked at the SSE functions closely yet, but if you want I can try to put together some stand-alone test cases, or you can just grab a copy of SIMDe and run the tests yourself.