CPU performance problem on Jetson TX1

Hi there,

I started to work with Jetson TX1 and I found some problems with CPU performance. I investigated it in more details. I wrote the small program for testing CPU performance(is just matrix multiplication). I will attach this code.

Tests:

  1. I added OpenMP pragma to this code for using all cores. I compared performance TX1 with TK1. So we got following results.

matrix dimensions is 1000
TX1
number of used cores, execution time sec
1 - 8 sec
2 - 6.2 sec
4 - 4.6 sec

TK1
1 - 16
2 - 8.2
4 - 4.8

How we can see using 4 cores on TK1 give us a more acceleration than on TX1, 3.33 and 1.74 respectively for TK1 and TX1. It’s very strange because matrix multiplication is good task for parallelization. But I tried to increase size of task.

matrix dimensions is 1500
TX1
using core, execution time sec
1 - 100
2 - 51
4 - 28

Here we got good acceleration. But I don’t understand why. May be dimensions is 1000 is very small task for TX1. May be do you have some ideas about it?

  1. After this I did the other test. There is taskset program for CPU affinity for program. So I launched one by one matrix multiplication on each cores. Each instance uses only one core and on one CPU core works only one instance of matrix multiplication.

number of launched instances, execution time sec by one instance
TX1
1 - 8
2 - 9.8
3 - 13.3
4 - 26

TK1
1 - 16
2 - 16.7
3 - 17.5
4 - 18.9

These are very strange results. Performance on one CPU core is decreased a more than 3 times! May be anyone can try to reproduce these results on own TX1? Or anyone can give some recommendation for avoiding these problems. This test reproduces the real case of work big system on Jetson TX1. And now there are some problems with that.

Also I use scripts for maximizing performance from here. Jetson/TX1 Controlling Performance - eLinux.org

On Jetson TX1 I use JetPack 2.3.

I have got the very strange results! Do anyone have any idea about these problems?

My code here:
[url]https://drive.google.com/file/d/0B5kxxqltJ-u9ZGxLYldCSmNFNFU/view?usp=sharing[/url]

Hi Alexander07K,
Thanks for your data …

For #1, that’s an interesting observation with matrix dimension of 1000 and 1500. A few notes,

  • how many iteration you run for dim of 1000?
  • you could use tegrastats (home dir after flashing the image) to double confirm # of CPU running, its frequency and also memory freq
  • for the dimension of 1000 where you did not see almost 2 times acceleration, if the code is not purely CPU bound, some CPU might be idle at some point waiting for memory transfer. In that case, it might be idle. Or the CPU frequency might not be always at its full speed. Tegrastats will be able to tell that. In fact, 2 CPUs test case already indicated the symptom ~ not nearly 2 times of performance.
  • One other item that might be different is power governor policy of CPU freq scaling

We will look into more at the same time…

Hi Alexander07K,

I ran your code and can observe what you described above.

However, one thing is interesting; both 1 core and 4 cores show the same memory bandwidth usage as below:

  • $ ./run_with_taskset.sh 0-3

RAM 1129/3994MB (lfb 75x4MB) cpu [100%,100%,100%,100%]@1734 EMC 1%@1600 AVP 0%@80 GR3D 0%@76 EDP limit 1734

  • $ ./run_with_taskset.sh 0

RAM 1128/3994MB (lfb 75x4MB) cpu [100%,2%,0%,0%]@1734 EMC 2%@1600 AVP 0%@80 GR3D 0%@76 EDP limit 1734

So I’m suspecting if this isn’t a good example of checking the performance gain from multicore programming. Maybe there are some false sharing issues in the code. Which may introduce dependency between cores.

Hi chijen, thank you for the answer.

  1. I used 100 iteration for 1000 dim.
  2. I used the max performance script.
    RAM 1663/3997MB (lfb 297x4MB) cpu [100%,100%,100%,100%]@1734 GR3D 0%@998 EDP limit 0
  3. Test 1000 dim on Jetson TK1 on 4 threads has 3.33 times of acceleration, but on Jetson TX1 we have only 1.74 times of acceleration. It means that more difficult fully load Jetson TX1 than Jetson TK1.
  4. Also I found information that on Jetson TX1 max CPU frequency is 1.9 GHz. But on my Jetson TX1 is only 1.7 GHz. Can you say me what it depends?

Hi vickyy,

Thanks for your answer.

Do you mean memory bandwith between RAM and CPU? Also I don’t have this data"2%@1600 AVP 0%@80" in my tegrastats output.
Also I think that 1% or 2% is very small memory bandwith it means that my code is purely CPU bound. What is max the memory bandwith?
And if you open my code you will be able to see that this code doesn’t have dependency between cores. This is only matrix multiplication in naive implementation.

Also you ran only one program in 1 core and 4 cores with OpenMP. For checking the performance gain from multicore programming you should run 4 programs in one core for each. Where I got decreased performance more than 3 times for one core! I think it’s the hardest problem for my purposes on Jetson TX1. Have you repeated this test?
Also I want to note that there aren’t these problems on Jetson TK1.

Hi Alexander07K,

  1. Please use “sudo ./tegrastats” to get EMC data

  2. Actually, this code may have dependency between cores.
    That is, openMP breakdowns task along k coordinate and the += operator leads to between-core dependency.

Try this version to prevent openmp divide task along k-coordinate

for(int k=0; k<N; ++k)
#pragma omp parallel for
for(int i=0; i<N; ++i)
for(int j=0; j<N; ++j)
{
    mult[i][j]+=a[i][k]*b[k][j];
}

I test it on tx1:
1- 3.816s
2- 1.648s
3- 1.136s
4- 0.914s

RAM 744/3994MB (lfb 677x4MB) cpu [96%,96%,96%,97%]@1734 EMC 28%@1600 AVP 0%@80 GR3D 0%@76 EDP limit 1734

I thinks there is still a better way to write test program but the results make more sense now.

Hello AastaLLL,

Thanks for your support.

In my version OpenMP breakdowns task along i coordinate because there is only one OpenMP pragma before external ‘for’ so my code doesn’t have between-core dependency.
But I tried to test your code. How I understand you tested 1000 dim.

My results:
1-4.134s
4-3.2s

1 thread
RAM 1221/3997MB (lfb 444x4MB) cpu [3%,100%,0%,0%]@1734 EMC 6%@1600 AVP 2%@12 GR3D 0%@998 EDP limit 1734
4 threads
RAM 1221/3997MB (lfb 444x4MB) cpu [97%,98%,98%,98%]@1734 EMC 8%@1600 AVP 2%@12 GR3D 0%@998 EDP limit 1734

So I cannot get your results. I used script of maximizing performance from this
http://elinux.org/Jetson/TX1_Controlling_Performance

Do you have any idea where it might be a problem?

But your version has more performance. I got acceleration on 2 times in 1 thread on your version than my version in 1 thread. How I understand your version works better with memory and cache than my.

Hi Alexander07K,

You are right, openMP breakdowns task along i coordinate and there isn’t dependency between core.
Sorry for this misunderstanding.

Run ~/jetson_clocks.sh can maximize tx1 performance.
In best performance setting, I got:
1-1.156s
2-0.730s
4-0.499s

with code here since it shows better.

#pragma omp parallel for
for(int i=0; i<N; ++i)
for(int k=0; k<N; ++k)
for(int j=0; j<N; ++j)

I ran jetson_clocks.sh. I tested your new version:
1-1.060s
4-0.650s

Also I tested 1500 dim:
1-3.5s
4-2.45s

It’s a very small acceleration only 1.6 and 1.4 times. The same tests on Jetson TK1 give me 3.3 times of acceleration. Do you think it is normal situation for TX1?
Also I repeated the 2 test with the last version of program, 1500 dim. I ran 4 programs in 1 thread in 4 different consoles

./run_with_taskset.sh 0 // - 1 console
./run_with_taskset.sh 1 // - 2 console
./run_with_taskset.sh 2 // - 3 console
./run_with_taskset.sh 3 // - 4 console

number of launched instances, execution time sec by one instance
1-3.4
2-11.1
3-16.5
4-19.9

Can you repeat this test?

Hi Alexander07K,

I think it is a good idea to select a program with less cache miss to evaluate cpu performance.

For example, in matrix multiplication

int main() {
    const int N = 1000;

    int *a = (int*)malloc(N*N*sizeof(int));
    int *bt = (int*)malloc(N*N*sizeof(int)); // bt is transpose matrix of b, used for lower cache miss
    int *mult = (int*)malloc(N*N*sizeof(int));

    #pragma omp parallel for
    for(int i=0; i<N; ++i)
    for(int j=0; j<N; ++j)
    {
        int num = 0;
        for(int k=0; k<N; ++k)
        {
            num += a[i*N+k]*bt[j*N+k];
        }
        mult[i*N+j] += num;
    }

    free(a);
    free(bt);
    free(mult);

    return 0;
}

I ran it with best performance on tx1 and got:
1-0.473s
2-0.239s
4-0.123s → 3.85x speedup

Also test for N=1500:
1-1.561s
2-0.796s
4-0.403s → 3.87x speedup

Also ran 4 program in 1 thread in 4 different consoles, I got:
1- 0.473s
2- 0.469s
3- 0.472s
4- 0.472s

these results are pretty reasonable.

Hello, AastaLLL

Thank you for the quick response.

Great! But it means that Jetson TK1 works with memory(cache or something else) better than TX1. Very interesting why?

I got for N=1000:
1-0.472
4-0.218 → 2.16x speedup

Also test for N=1500:
1-1.59
4-0.75 → 2.12x speedup

I use jetson_clocks.sh.

Test with 4 program for N=1000:
1-0.472
2-0.472
3-0.837
4-0.837

I checked the results many times. But I got the different results. And I don’t know why. Do you have any idea?

Hi Alexander07K,

On tx1, memory issue become more serious may due to it encounter cache miss more often. (faster)

Could you help to paste tegrastate results on no-execution/1-core/4-cores for me analysis?

Hello, AastaLLL

Yeah, surely.

I tested one program for N=1500.

no-execution:
RAM 1237/3997MB (lfb 439x4MB) cpu [1%,0%,0%,0%]@1734 EMC 0%@1600 AVP 56%@12 GR3D 0%@998 EDP limit 1734

1 core:
RAM 1245/3997MB (lfb 439x4MB) cpu [0%,100%,0%,2%]@1734 EMC 0%@1600 AVP 14%@12 GR3D 0%@998 EDP limit 1734

4-core:
RAM 1245/3997MB (lfb 439x4MB) cpu [78%,78%,77%,77%]@1734 EMC 0%@1600 AVP 11%@12 GR3D 0%@998 EDP limit 1734

I tried to test N=2500. But I got the same 2.12x speedup. And I saw full system load.

Hi Alexander07K,

Sorry for the late reply.
Our QA has tested on the different tx1 platform, but still can’t reproduce this issue.

  • Differenet # of cpu-core
    0 → 0.472s
    0-1 → 0.237s
    0-3 → 0.121s

  • Single core in different cpu-id
    0 → 0.470s
    1 → 0.466s
    2 → 0.467s
    3 → 0.467s

Could you try Tegra System Profiler to get more system information?
Tegra System Profiler can be downloaded directly by JetPack-2.3

Please select

Collect PMU Counters
L1 cache misses: V read V write V Instruction
to get cache miss data

For example:
https://drive.google.com/open?id=0B-fFMM_3Dj9JNFlfY3VVWkxZcnM

Thanks

Hi, AastaLLL

Thank you for your support.

Yeah, surely. I tested for N=1000.

[url]https://drive.google.com/file/d/0B5kxxqltJ-u9UExpNENxc3c1ZjA/view?usp=sharing[/url]

Hi,

Could you paste the .qdrep file which usually located at ‘$HOME/.tegraprofiler’.

For example,
$HOME/.tegraprofiler/Projects/Project\ 1/Report\ 8.qdrep

Hi,

Yeah, no problem.

[url]https://drive.google.com/file/d/0B5kxxqltJ-u9UlltZWp5QVhHck0/view?usp=sharing[/url]

Hi,

I have some new information about the problem. I did the test on official dev board and this problem doesn’t repeat on it. Our Jetson TX1 is custom board with Jetson TX1. May be there are some problems with it. What do you think about it?

Thank you

It’s an important information.
We will reply if we have further update.

Thanks!

Hi Alexander,

May I know your custom TX1 means TX1 module on a custom carrier board?

Thanks.