Jetson nano CPU and GPU sample program


I m new in jetson nano board.
I Tried so many simple examples of vector addition on jetson nano GPU using Cuda but I did not get a processing time difference between CPU code and GPU code . can anyone guide me or give me a link of the basic simple code which shows the difference between CPU and GPU processing time difference?

thanks in advance


You can find some examples in our CUDA sample.
Ex. /usr/local/cuda-10.0/samples/0_Simple/matrixMulCUBLAS/

There is a CPU implementation and GPU implementation.
Although we only shows the GPU-based execution time, you can simply add some code to output CPU-based information.


Also beware that the CPU/GPU synchronization overhead is significant. It’s not as bad as for a desktop GPU (because of the shared memory) but it’s still massive, eg, everytime you run a CUDA core, if you’re processing less than, say, 10,000 vectors in that one call, chances are it would have been lower latency to just do it on the CPU. Those 10,000 vectors could be a 100x100 matrix, or a single long convolution, or something else similar – this is just a rule of thumb, not an exact science.

When you say “simple examples of vector addition,” my guess is that the examples don’t actually contain enough work to make the GPU really worth it; instead it probably spends all its time remapping memory and synchronizing.