Code execution slower after flashing jetson TX2 with jetpack l4t3.1

I encountered a strange issue. I unboxed my jetson tx2 and ran the .sh file (that comes shipped with tx2) to install ubuntu. Then I ran the script below it executed in 20 microseconds. The compilation command did not include any architecture specific info like: -march or -mtune. Then I flashed the tx2 with jetpack (JetPack-L4T-3.1-linux-x64) and then ran the scrip again, this time it takes 200 microseconds. No other programs are running, just whatever daemons in the background after flashing.

Why is this 10X slower? How do I get back to the original installation. I did not copy it my mistake. Please help.

Compilation command:

g++ main.cc -o main -std=c++11 -O3

Script

#include <iostream>
#include<vector>
#include<chrono>

int main()
{
    std::vector<float> v1(100000,2.0f);
    std::vector<float> v2(100000,1.5f);
    std::vector<float> v3(100000,1.5f);

    auto tick = std::chrono::high_resolution_clock::now();

    for (uint32_t i = 0; i < 100000; ++i)
        v3[i] = v1[i] *v2[i];

    auto tock = std::chrono::high_resolution_clock::now();

    std::cerr << std::chrono::duration_cast<std::chrono::microseconds>(tock-tick).count() <<"\n\n";

    return 0;
}

I searched on the website it seems this is the factory image that comes with TX2
http://developer.nvidia.com/embedded/dlc/tx2-production-image

Have you boosted your TX2 ?

sudo nvpmodel -m0
sudo /home/ubuntu/jetson_clocks.sh

Hi,

Please try Honey_Patouceul’s command to maximize TX2 performance first.

I can get your program done within 0.004s.

nvidia@tegra-ubuntu:~$ time ./test 
153


real	0m0.005s
user	0m0.004s
sys	0m0.000s

Thanks.

Thank you guys for your kind reply.

@Honey_Patouceul:
Yes, I have ran jetson_clocks.sh script but not the nvpmodel -m0. I ran the program after issuing both these commands it now takes ~150 microseconds.

@AastaLLL:
Yes, I am getting the same results as well.

Starting to rethink if my initial results were somehow wrong(may I ran the for loop for 10K indices - probably had a typo in the code):

0.1 million ops in 20usec means in 1 sec: 0.1 x 10^6 x 10^6/20 = 10^9 x 5 ops per sec
0.1 million ops in 150usec means in 1 sec: 0.1 x 10^6 x 10^6/150 = 10^9 x 0.67 ops per sec

Do you guys think the former is even possible? I have not calculated the flops (x cycles/sec x flops/cycle)

Hi,

Set nvpmodel to Max-N (mode=0 ) will enable two Denver GPU. (Default=off)
jetson_clocks.sh will lock CPU/GPU frequency to the maximal.

Please remember to run this two command to have the best performance.
Thanks.

Hi,

I ran with the commands

sudo nvpmodel -m 2 and jetson_clocks.sh

I am getting around ~80 to ~110 microseconds. Thank you for that. This makes me wonder if it is at all possible to extract more performance out of the ARM core.

Is there something else I can do to boost the performance out of ARM cores.

Can you please confirm that this is indeed the image that comes shipped with jetson tx2.
https://developer.nvidia.com/embedded/dlc/tx2-production-image

Thank you so much for all the help.

Hi,

Sorry for my typo. Max-N is model 0, not mode 2.(Already correct the information in comment#6)
Here is the nvpmodel information:

The maximum frequency of TX2 CPU is 2.0 GHz.(both A57 and Denver)
jetson_clocks.sh will lock CPU frequency to 2.0GHz(max) and give users the best performance.

Thanks.