NVCC compile errors when using SSE intrinsic functions, with GCC as host compiler

Hello, I’m getting compile errors when compiling this code:

#include <emmintrin.h>
int main()
{
  __m128i x;
  int y;
  y = _mm_extract_epi16(x, 1);
  return 0;
}

the error message is

a.cu(13): error: identifier “__builtin_ia32_vec_ext_v8hi” is undefined

I’m using the CUDA 5 toolkit on SUSE 11 enterprise (gcc 4.3), which should be a supported platform. I also tried using gcc 4.6, but the same error occurs.

Can this problem be fixed or will I just have to avoid this combination?
Also, is this compile error coming from NVCC itself or gcc ?

I am aware that there were some issues with the use of SSE intrinsic header files in the host portion of CUDA programs (.cu files) in the past. Not sure what the issue was, may have had to do with certain #ifdefs used inside the header files, but that is just speculation. The easy workaround was to put all host code containing SSE intrinsics into separate C or C++ source files.

However, more recently I did not encountered any problems when using xmmintrin.h (SSE) in the host portion of my CUDA program (across Linux, Windows, and MacOS X), but I have not tried emmintrin.h (SSE2). You probably need to enable SSE2 by passing the -msse2 command line flag to gcc, which may also take care of necessary #ifdefs in the intrinsic header files. To do so from the nvcc commandline, use -Xcompiler -msse2.

I tried -msse2, but it didn’t help. Besides, SSE2 is part of the x86-64 instruction set, so it’s always on. I also tried it for other intrinsic functions that don’t use __builtin* functions and those compile fine.

I have a feeling it’s a problem with the NVIDIA compiler. It shouldn’t be compiling that code anyways, especially if it uses compiler intrinsic functions.

If I don’t see any solution soon, I’m going to file a bug report.

Host code must be pre-processed by the CUDA compiler before it is passed to the host compiler. There could be an issue with that pre-processing, or maybe nvcc does not pass some flag to the host compiler that is required for the successful processing of SSE intrinsics.

Filing a bug report (through the registered developer website) with a self-contained repro program is the best approach to getting this resolved. Thank you for your help.

For now, as a workaround, you can simply separate host code containing SSE intrinsic into a separate file that is directly compiled by the host compiler.

FWIW, I was able to compile the above without error (using ‘nvcc ssetest.cpp’) with the CUDA 5.0 toolkit and gcc versions 4.5.1 and 4.6.3 as the host compilers.

tbenson, can you post the code for _mm_extract_epi16() in your emmintrin.h ?

Mine is like this:

#ifdef __OPTIMIZE__
extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_extract_epi16 (__m128i const __A, int const __N)
{
  return __builtin_ia32_vec_ext_v8hi ((__v8hi)__A, __N);
}

Then I noticed the intrinsic does not follow the naming pattern of the others, which is to use the same machine instruction mnemonic from Intel. I searched for the symbol __builtin_ia32_vec_ext_v8hi in the CUDA compiler and didn’t find it, but sure enough, I found __builtin_ia32_pextrw, where pextrw is the real name of the instruction.

I tried #define __builtin_ia32_vec_ext_v8hi __builtin_ia32_pextrw, but that didn’t work because __builtin_ia32_pextrw only takes __m64, instead of __m128i.

Also, can you grep to see if __builtin_ia32_vec_ext_v8hi is defined in the CUDA compiler?

Thank you

Uncle Joe,

My _mm_extract_epi16() function is the same except for a type conversion (although removing it had no effect):

#ifdef __OPTIMIZE__
extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_extract_epi16 (__m128i const __A, int const __N)
{
  return (unsigned short) __builtin_ia32_vec_ext_v8hi ((__v8hi)__A, __N);
}

I do not follow the question about the builtin function being defined in the CUDA compiler. I ran strings and objdump on nvcc just to check, but I would expect the builtin functions to be defined by the host compiler to which nvcc will delegate compiling host code.

If you compile with ‘nvcc -Xcompiler=-v ssetest.c’, then it should give verbose output from the host compiler, which will include the path to the compiler executable. In my case, the host compiler is gcc with corresponding executable /usr/libexec/gcc/x86_64-redhat-linux/4.6.2/cc1.

I can then find the referenced builtin in that executable:

strings /usr/libexec/gcc/x86_64-redhat-linux/4.6.2/cc1 | grep builtin_ia32_vec_ext_v8hi
__builtin_ia32_vec_ext_v8hi

Hope that helps

I’ve filed a bug report to NVIDIA and a helpful representative was able to reproduce the problem and says that it has been fixed in CUDA 5.5, which will be out soon.

Right, I’m puzzled too why NVCC is complaining about code that it’s not even suppose to compile, but apparently, it has stubs for all the GCC builtin functions.