TX2 has always run one CPU core, no matter how many threads are launched.

Hi experts,

Here I encountered a strange problem which is mentioned as the Title.

First, I run ‘lscpu’ command, and It shows as follows:
https://pan.baidu.com/s/1nuDHWAt

It shows that there are 4 CPU cores on-line.

But when I launched 32 threads in a test program, the system seems to put all threads on only one CPU core. That is very strange. The test program and what the ‘top’ command and system Monitor shows are as follows:

  • 1. test program ``` #include #include #include using namespace std;

    int main(int argc, char *argv){
    long result = 0;

    #pragma omp parallel for num_threads(32)
    for(long i=0; i<100000000; ++i){
    printf(“This is thread %d\n”, omp_get_thread_num());
    }

    cout<<“result:”<<result<<endl;
    return 0;
    }

    
    </li>
    
    <li>2. top command and system monitor shows<a target='_blank' rel='noopener noreferrer' href='https://pan.baidu.com/s/1qYRzpX6'>https://pan.baidu.com/s/1qYRzpX6</a>
    </li>
    
    Please help me with this problem.
    Thanks!
  • Please somebody’s here

    Hi oprell,

    Please check with below command.

    $ ps -o spid,psr -T -p 3140

    Hi vickyy,

    I run the command you suggested. But it only shows below.

    nvidia@tegra-ubuntu:~$ ps -o spid,psr 3140
     SPID PSR
    

    There’s nothing useful.

    You have to adjust the “-p 3140” to be the PID of the process…which is a moving target.

    Hi, linuxdev

    Thanks for your hint.

    I checked command ’ ps -o spid,psr -T -p 3221’, and I got below output.

    SPID PSR
     3221   0
     3222   0
     3223   0
     3224   3
     3225   3
     3226   0
     3227   0
     3228   0
     3229   3
     3230   0
     3231   0
     3232   3
     3233   3
     3234   3
     3235   0
     3236   0
     3237   0
     3238   0
     3239   0
     3240   0
     3241   0
     3242   0
     3243   0
     3244   0
     3245   0
     3246   3
     3247   0
     3248   0
     3249   3
     3250   0
     3251   3
     3252   0
    

    I launched 32 threads in my test program.
    It seems to be that the program execute on one CPU core. What’s wrong with it?

    My guess is that OpenMP is not working as you assume it will.
    This can be because of compiler options, because of runtime options, or because of other problems.

    What happens if you use pthreads instead of the OpenMP pragmas?

    I’m not familiar with openmp but on my side it looks different from yours. FYI.

    nvidia@tegra-ubuntu:~$ ps -o spid,psr -T -p 30550
     SPID PSR
    30550   4
    30551   5
    30552   3
    30553   5
    30554   4
    30555   4
    30556   3
    30557   3
    30558   5
    30559   3
    30560   5
    30561   5
    30562   4
    30563   5
    30564   3
    30565   5
    30566   4
    30567   3
    30568   4
    30569   0
    30570   5
    30571   5
    30572   3
    30573   5
    30574   3
    30575   3
    30576   4
    30577   0
    30578   4
    30579   3
    30580   5
    30581   4
    

    It shows two CPU cores, not one (core 0 and core 3). Performance mode may modify which cores are available (except core 0, which is always available).

    Prior to testing try maximizing performance mode:

    # To see what is enabled:
    sudo cat /sys/devices/system/cpu/online
    # To see available modes:
    sudo nvpmodel -p --verbose
    # Set performance:
    sudo nvpmodel -m0
    # Also:
    sudo /home/ubuntu/jetson_clocks.sh
    

    Try your test when all cores are guaranteed online.

    Consider also that unless you specifically force a given core that the scheduler may be picking the best core. The obvious answer to always use all cores may not actually be the correct answer. The problem is that cache has a lot to do with performance, and a cache miss is very expensive relative to a cache hit. If the threads share data it may be a case of the scheduler trying to take advantage of cache. Whether you would want to override this or not might depend on whether you believe it is compute bound or if it is instead data bound as the speed bottleneck. Once all cores are enabled you probably need to profile before you try to outguess the scheduler.

    I tried this commands, but it doesn’t show any effect. Out of 6 cores only 4 cores are activated in my TX 2. Any suggestion please!!.

    If you run “sudo nvpmodel -m 0”, then all cores should be active. This doesn’t mean all cores will have software running on them, but it does mean all cores are available.

    For a more intuitive test, you can install htop (“sudo apt-get install htop”), then run “htop”. You will see cores listed as bar charts at the top. See if each core varies a bit as you do different things on the system…web browsing is probably a good way to make things jump around.

    I do not know about the OpenMP, and there things to consider even about ordinary thread models. The system has a scheduler, and unless something has intervened, then the core the process or thread runs on is entirely up to the scheduler. Core affinity can be used put a process on a single core, but spreading threads out from a single process is a lot more complicated.

    The scheduler is aware of cache. Any time you migrate from one core to another you will tend to have a cache miss instead of a cache hit. In cases where the scheduler is keeping things on one core it may in fact be because this is faster due to fewer cache misses.

    If you are interested in forcing certain threads or processes to a given core, then you will want to understand process/thread affinity. Here is a good reference to consider:
    https://www.linuxjournal.com/article/6799

    There are variations on this, e.g., for kernel thread affinity of kthreads (which works with driver modules even though not a user space application).

    Not having your threads distribute to all cores is not the same as having inactive cores and can in no way be considered a lack of a working core (cache hit/miss, the scheduler, threading models, so on, is actually a very complicated topic).