ARM Assembler on Jetson TK1

Hi guys,

I want to start developing applications that make use of the ARM NEON instruction set on the Jetson Tegra K1 board. On ARM’s official site it mentions armasm being the official assembler (and armclang as the compiler). My problem is that none of these two programs are installed by default in the Linux4Tegra distro and I can’t find them in any repositories. Does anybody know where I can find and download the ARM Compiler kit?

You can use GCC to compile NEON, although there probably are multiple options. You can also cross-compile it on a PC and then copy it to the device before running.

Have you considered using the GPU if you want something fast?

As mentioned above, GCC is the most common way to compile code for Jetson TK1, whether it is C or C++ or Assembly code or a mix. If you don’t have much or any experience in writing & compiling ARM Assembly code, I have an article explaining some of the basics at http://www.shervinemami.info/armAssembly.html. The article was written mostly for the Cortex-A8 or Cortex-A9 based iPhone 3GS/4 but it covers NEON SIMD (the main reason to use Assembly code on ARM) and mentions the few different options you have for writing Assembly code on ARM CPUs, and has links to other tutorials & resources. For older ARM devices you can definitely get better perf by using Assembly code, but on Cortex-A15 in Tegra K1, you will often get the same level of perf using NEON C Intrinsics and it will be much easier to write than Assembly code, so I recommend NEON C Intrinsics over ARM NEON assembly code for Jetson TK1.

Jetson TK1 uses Cortex-A15 therefore is noticeably more advanced & faster than Cortex-A9, but as mentioned above, you will often get even better perf if you use CUDA for GPU acceleration, and CUDA code is much easier to write than NEON code, so I recommend CUDA over NEON (assuming you are doing lengthy operations).

Thanks a lot for the answers, I basically want to do sort of a design space exploration to see what kind of performance I can get out of the different choices that I have available on this board (CUDA, NEON, multicore, etc.). So that’s why I want to write NEON code.

Indeed I’ve read that hand-written assembly would be much faster than NEON intrinsics, and that’s why I want an assembler. ShervinE is there any official source for intrinsics being the same as handwritten assembly on A15 or is it just from your personal experience?

It’s just from my experience. Basically, NEON Intrinsics are intended to give identical perf as assembly code in most cases, but realistically, GCC is quite bad at ARM code optimization and so it will typically generate code that isn’t quite as good as hand-written NEON assembly code. On a Cortex-A8 or even Cortex-A9 you will probably notice the difference in perf between NEON Intrinsics and NEON Assembly code, but Cortex-A15 is considerably more advanced than Cortex-A8 & Cortex-A9 and it includes hardware level optimizations inside the CPU so that you often won’t see a different in perf between NEON Intrinsics and NEON assembly code. There will definitely be times when GCC does a really terrible job with NEON Intrinsics, so if you use Intrinsics and then see bad perf it is a good idea to look at the generated assembly instructions to figure out if it is doing something aweful like processing everything from DRAM instead of NEON registers, or if your code just isn’t suited to NEON or you implemented it badly.

By the way, I gave a basic summary of NEON & multi-core vs CUDA differences in OpenCV on Jetson TK1 at the top of http://elinux.org/Jetson/Computer_Vision_Performance.