My program has a for-loop and each iteration carries out an expensive calculation. I used pthread to divide the for-loop into 4 threads and each thread processed only 1/4 of the iterations. However, I did not ways much speed up. I am not sure why. How can I know the threads are allocated to different cores and were running in parallel?
you can check on which core a thread is currently running with PSR field:
ps -o pid,lwp,psr,comm
If you want to bind a thread to a given core, you may read this topic.
For testing I’d also make sure clocks are maxed out:
sudo ~nvidia/jetson_clocks.sh