I started to work with Jetson TX1 and I found some problems with CPU performance. I investigated it in more details. I wrote the small program for testing CPU performance(is just matrix multiplication). I will attach this code.
Tests:
I added OpenMP pragma to this code for using all cores. I compared performance TX1 with TK1. So we got following results.
matrix dimensions is 1000
TX1
number of used cores, execution time sec
1 - 8 sec
2 - 6.2 sec
4 - 4.6 sec
TK1
1 - 16
2 - 8.2
4 - 4.8
How we can see using 4 cores on TK1 give us a more acceleration than on TX1, 3.33 and 1.74 respectively for TK1 and TX1. It’s very strange because matrix multiplication is good task for parallelization. But I tried to increase size of task.
matrix dimensions is 1500
TX1
using core, execution time sec
1 - 100
2 - 51
4 - 28
Here we got good acceleration. But I don’t understand why. May be dimensions is 1000 is very small task for TX1. May be do you have some ideas about it?
After this I did the other test. There is taskset program for CPU affinity for program. So I launched one by one matrix multiplication on each cores. Each instance uses only one core and on one CPU core works only one instance of matrix multiplication.
number of launched instances, execution time sec by one instance
TX1
1 - 8
2 - 9.8
3 - 13.3
4 - 26
TK1
1 - 16
2 - 16.7
3 - 17.5
4 - 18.9
These are very strange results. Performance on one CPU core is decreased a more than 3 times! May be anyone can try to reproduce these results on own TX1? Or anyone can give some recommendation for avoiding these problems. This test reproduces the real case of work big system on Jetson TX1. And now there are some problems with that.
For #1, that’s an interesting observation with matrix dimension of 1000 and 1500. A few notes,
how many iteration you run for dim of 1000?
you could use tegrastats (home dir after flashing the image) to double confirm # of CPU running, its frequency and also memory freq
for the dimension of 1000 where you did not see almost 2 times acceleration, if the code is not purely CPU bound, some CPU might be idle at some point waiting for memory transfer. In that case, it might be idle. Or the CPU frequency might not be always at its full speed. Tegrastats will be able to tell that. In fact, 2 CPUs test case already indicated the symptom ~ not nearly 2 times of performance.
One other item that might be different is power governor policy of CPU freq scaling
So I’m suspecting if this isn’t a good example of checking the performance gain from multicore programming. Maybe there are some false sharing issues in the code. Which may introduce dependency between cores.
I used the max performance script.
RAM 1663/3997MB (lfb 297x4MB) cpu [100%,100%,100%,100%]@1734 GR3D 0%@998 EDP limit 0
Test 1000 dim on Jetson TK1 on 4 threads has 3.33 times of acceleration, but on Jetson TX1 we have only 1.74 times of acceleration. It means that more difficult fully load Jetson TX1 than Jetson TK1.
Also I found information that on Jetson TX1 max CPU frequency is 1.9 GHz. But on my Jetson TX1 is only 1.7 GHz. Can you say me what it depends?
Do you mean memory bandwith between RAM and CPU? Also I don’t have this data"2%@1600 AVP 0%@80" in my tegrastats output.
Also I think that 1% or 2% is very small memory bandwith it means that my code is purely CPU bound. What is max the memory bandwith?
And if you open my code you will be able to see that this code doesn’t have dependency between cores. This is only matrix multiplication in naive implementation.
Also you ran only one program in 1 core and 4 cores with OpenMP. For checking the performance gain from multicore programming you should run 4 programs in one core for each. Where I got decreased performance more than 3 times for one core! I think it’s the hardest problem for my purposes on Jetson TX1. Have you repeated this test?
Also I want to note that there aren’t these problems on Jetson TK1.
Actually, this code may have dependency between cores.
That is, openMP breakdowns task along k coordinate and the += operator leads to between-core dependency.
Try this version to prevent openmp divide task along k-coordinate
In my version OpenMP breakdowns task along i coordinate because there is only one OpenMP pragma before external ‘for’ so my code doesn’t have between-core dependency.
But I tried to test your code. How I understand you tested 1000 dim.
But your version has more performance. I got acceleration on 2 times in 1 thread on your version than my version in 1 thread. How I understand your version works better with memory and cache than my.
I ran jetson_clocks.sh. I tested your new version:
1-1.060s
4-0.650s
Also I tested 1500 dim:
1-3.5s
4-2.45s
It’s a very small acceleration only 1.6 and 1.4 times. The same tests on Jetson TK1 give me 3.3 times of acceleration. Do you think it is normal situation for TX1?
Also I repeated the 2 test with the last version of program, 1500 dim. I ran 4 programs in 1 thread in 4 different consoles
I think it is a good idea to select a program with less cache miss to evaluate cpu performance.
For example, in matrix multiplication
int main() {
const int N = 1000;
int *a = (int*)malloc(N*N*sizeof(int));
int *bt = (int*)malloc(N*N*sizeof(int)); // bt is transpose matrix of b, used for lower cache miss
int *mult = (int*)malloc(N*N*sizeof(int));
#pragma omp parallel for
for(int i=0; i<N; ++i)
for(int j=0; j<N; ++j)
{
int num = 0;
for(int k=0; k<N; ++k)
{
num += a[i*N+k]*bt[j*N+k];
}
mult[i*N+j] += num;
}
free(a);
free(bt);
free(mult);
return 0;
}
I ran it with best performance on tx1 and got:
1-0.473s
2-0.239s
4-0.123s → 3.85x speedup
Also test for N=1500:
1-1.561s
2-0.796s
4-0.403s → 3.87x speedup
Also ran 4 program in 1 thread in 4 different consoles, I got:
1- 0.473s
2- 0.469s
3- 0.472s
4- 0.472s
I have some new information about the problem. I did the test on official dev board and this problem doesn’t repeat on it. Our Jetson TX1 is custom board with Jetson TX1. May be there are some problems with it. What do you think about it?