FP16 support for the ARM processor


I would like to know if the FP16 instructions are available for the TX2 processor (not GPU). It appears that the TX1 supports some intrinsics for loading/storing/converting fp16 values ( with -march=armv8.1-a+fp16 ) but there is no fp16 arithmetic instructions available.
It seems that the -march=armv8.2-a+fp16 flag with GCC7 allows to compile a simple code with the vmulq_f16 intrinsics but running it on the TX1 produces an invalid instruction error.

Thanks !


Here is a sample code that generates the runtime error mentioned above.

The below code is compiled using gcc 7.1 with
g++ -march=armv8.2-a+fp16 [file]

If the compiled executable runs without error, then we know that the TX2 indeed has the FP16 SIMD support we’re looking for. Ideally we would be able to judge just by looking at the ARM processor documentation but (in my experience) it has proven to be less than perfect on this subject, so I’d prefer to have experiential evidence.


#include <vector>
#include <algorithm>

#include <arm_neon.h>

float16_t my_rand() {
  return float16_t( double(rand()) / double(RAND_MAX) );

int main(void) {
  srand( 11 );
  int sz = 800;

  std::vector<float16_t> flt1(sz);
  std::vector<float16_t> flt2(sz);
  std::vector<float16_t> flt3(sz);
  std::generate( flt1.begin() , flt1.end() , my_rand );
  std::generate( flt2.begin() , flt2.end() , my_rand );

  for ( size_t i = 0 ; i < sz ; i += 8 ) {
    float16x8_t f1 = vld1q_f16( &(flt1[i]) );
    float16x8_t f2 = vld1q_f16( &(flt2[i]) );
    float16x8_t f3 = vmulq_f16( f1 , f2 );
    vst1q_f16( &(flt3[i]) , f3 );

  return 0;

This won’t answer your question, but will probably be a start…

The ARM Cortex-A57 is ARMv8 architecture, which means in native mode it won’t work with older arm/arm32/aarch32 code…for this it must switch to a compatibility mode. That compatibility mode supports armhf with NEON, so if in that mode, then ARMv7 code (with or without NEON) will run. The standard install of a JTX2 includes user space support for 64-bit ARMv8, but does not include linkers or other libraries for foreign architectures…in this case arm/arm32/armhf is a foreign architecture (this is similar to a desktop PC using 64-bit x86_64 as its native mode…if 32-bit compatibility libraries are installed then i386 or i686 could be run and would be considered a foreign architecture). One cannot mix and use both arm/armhf/NEON and aarch64 instructions simultaneously since the CPU can’t be in both modes at the same time.

One thing about gcc 7.x is that it had some ABI changes. I don’t know what the effect is on this particular kernel, but you may end up with some incompatibilites using existing linker/libraries in user space unless you’ve rebuilt those with gcc 7.x as well (I haven’t tried, but I wouldn’t be surprised if it works in most cases and fails in a few other cases).

From a new Anandtech article : http://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55/3 , it seems that FP16 will be natively available on future Cortex A75 processors (ARMv8.2), so I guess current ARM processors including the one in the TX2 are not able to perform FP16 operations…
Maybe in the next Jetson TX3 boards ???

@kennypewpew, what was your compile command (e.g., including linking) for that test code?

The compile command is:

g++ -march=armv8.2-a+fp16 [file]

Your gcc needs to be version 7.1 or later in order to find the correct intrinsics and to recognize the armv8.2-a+fp16 architecture.

Thanks for testing!

But does the A57 that’s in the TX2 support armv8.2? It doesn’t look like it on wikipedia: https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores
It doesn’t help that some newer compiler supports it, when the actual hardware we have, doesn’t!

(And does Denver-2? Wikipedia doesn’t know which specific armv8 is supported at all.)

So far as I know the A57 does not support ARMv8.2.

Still, I am curious to see the test. @kenyypewpew, did you cross compile with a Linaro 7.1 compiler? If cross compiling, was your supporting linker and library environment (sysroot plus files for your specific build) built to match that compiler release? One compiler release example is:

From what I can tell by the reference manual, it should. Apparently not. Reading more closely, it only claims to support half precision SIMD load/store/conversion but not any real operations, which does match my experience so far with another A57.

…found the right page in the doc - looks like the A57 does only support ARMv8-A, so no ARMv8.2. Guess we’ll have to get access to an A75.

I compiled this natively on an ARM server with an early revision of the A57 using a freshly compiled gcc-7.1 and its corresponding libraries/headers. Since I do not actually own a Tegra device (the goal of this post is to see if it’s worth investing in one), I can’t help with the cross-compilation aspect. However, I would imagine you could accomplish the same thing by installing performing the same compilation process onboard the Tegra.