How can I get high computer speed of Jetson TK1 when using arm neon

for some library compatible reason, I could not use cuda and gpu, so I only use cpu to get better speed for my code with arm neon.
when I use neon to compile my code with compile option, such as “-march=armv7-a -mfpu=neon -mfloat-abi=hard -O3”, the code has not get better speed, and neon similarly does nothing for the speed. how can I do to get better performance with neon?

Those options make the NEON hardware available, and optimizes code for speed, but your program would have to actually use the NEON instruction set before it would help (NEON participation is not automatic). One way to do this is if the libraries your program links with use NEON; another way is to directly use NEON instructions in your program (the former is probably preferable since NEON is written in assembler).

So far as not being able to use CUDA and GPU, was there an error, or is there some unrelated reason? If you try to use CUDA with the wrong architecture support (such as using ssh from a remote host causing unexpected transfer of GPU function to the other computer), then this could be fixed. An additional requirement for using CUDA is that the user must be a member of group “video”, so a new user who is not “ubuntu” or “nvidia” would fail to use CUDA unless added to group “video”.

Thanks a lot for your replying!
The CAFFE libraries my codes links with could not compiled with CUDA, the NVCC failed to compiled passed. I guess CUDA version of my Jetson TK1 is 6.5 too older to support CAFFE after search on web, and Jetson TK1 could not support to update new version for CUDA. so I turn to use CPU optimization.

The version is an issue for many people, and unfortunately I think 32-bit itself makes it difficult or impossible to support newer CUDA versions.

Keep in mind that CAFFE libraries would themselves need to use NEON for this to help…in some cases CAFFE may itself call other libraries…somewhere in the chain of library calls there must be use of NEON for this to help. Not all computation types benefit from NEON, so it is possible in some cases NEON cannot help regardless of effort to customize. Here is a short description of NEON:

could I think in this way, similarly, The CUDA and GPU’s optimization for computing on Jetson TK1 perhaps is decided by many factor, even though my code and libraries link with is compiled completely by NVCC, the performance of Jetson TK1 is not necessarily improved, only when I use CUDA’s function instead of general function in my program in some place when time is critical?

This seems like a broad question…someone else may have a better answer.

CUDA has some similarities to threaded programming, which implies that not only would you need threaded function calls (or a CUDA kernel), you would also need data which takes advantage of doing multiple simultaneous parallel operations. An example would be if you are doing processing on an image, then likely CUDA would be a benefit…but if you only work on one line of the image at a time, then CUDA will actually be slower. You could perhaps work on hundreds of lines of data at the same time, and the time would not increase…in which case you would achieve hundreds of times more work done with very little increase in time requirement. CUDA might mean processing a small image is a slight slowdown, but processing a very large image could be almost no increase in time requirement. CUDA scales well, and if data does not support the model, CUDA slows things down.

If the data can benefit from CUDA, and if the library you link to uses CUDA, then linking to the CUDA version will speed things up. Much depends on the data.

If you want to optimize your code to take advantage of the NEON instructions you have 4 possibilities :

  1. let the compiler vectorize the code for you… From my experience with compilers (gcc, clang on x86/ARM/Power and icc on x86) it does not work at all (in most cases !) or it works on very simple codes but you can obtain faster codes by using intrinsics.
  2. write a NEON version of your code using intrinsics (or assembly…) : reference here
  3. use an optimized library if available (lots of libraries are not well optimized and you can often beat them by writing your own implementation, its the case for FFTW, OpenCV, …)
  4. buy a bSIMD or Arch-R licence from Numscale which provides respectively provide a high level abstraction for SIMD instructions and a mathematical library for image/signal processing.

Another thing to take into account is the hardware implementation of NEON. On the TK1, the speedup I obtain on a lot of codes (matrix operation, image processing, …) is not greater than 2x which leads me to the conclusion that the NEON implementation is not fully 128bit but 2x64bits (like AMD with its AVX implementation on the new Zen architecture), which means that when applying a NEON instruction to 2x 128bit registers, the instruction is splitted into 2x 64bit instructions applied on the first and second halves of the registers. On the TX1, the same codes provide more coherent speedups (nearly 4x).


  1. Caffe can work on TK1. Here is some tutorial:

  2. You can maximize CPU/GPU frequency to get the best performance.