Speed between CPU NEON and CUDA


I found this artical:

If you look at page 8/9, the two graphics, we can see on the medianblur process between CPU/CPU NEON/CUDA.
These two graphics tell that the CPU NEON process is faster than CUDA process to do a medianblur, am I wrong?

Is the cuda process speeded than the CPU NEON process?

It looks like those comparisons are on Tegra 4 (probably similar to a Tegra K1 if you remove the K1’s GPU), I’m not sure how one could compare to Tegra K1/X1 since the K1/X1 GPU kind of makes an apples/orange comparison. The NEON instruction set is obviously quite a boost compared to purely CPU-based.

I have no knowledge of how those graphs were produced, but typically your CUDA speedup relies on a high number of threads, versus very fast low thread count of CPU/NEON…which means CUDA has the opportunity to scale far better than CPU/NEON. I also suspect that the cache is probably disabled on the GPU operations, whereas cache remains active on CPU/NEON, so latency would go up on CUDA versus CPU/NEON…but average throughput on a large thread count for CUDA would potentially be far faster on average.

Thank you for the answer.

An other question about CPU NEON registers, are the register commun to all CPU core or are all the core have his own register?

In fact, I would like to create a multithread, and on each thread process NEON instructions.

So far as I know each core has NEON and is identical…the instruction sets of each core should match. The most notable difference would regard CPU0 being the only core which can handle hardware IRQs. So long as your use of NEON does not occur in a hardware driver there won’t be anything special required beyond normal threading.