Matrix Multiplication is slow on Denver

Hi, I am experimenting with TX2 Denver vs A57 performance with a matrix multiplication code.
I see that when Size is > 64 Denver is slower than A57. Also, there is a difference in performance (i.e Denver is slow) when I use heap vs stack.
Any lead in this situation?
Code:

#include <iostream>
#include <chrono>
#include <omp.h>
#define HEAP 0
using namespace std; 
void warmup2(int SIZE){
#if HEAP
  double** a = new double*[SIZE];
  double** b = new double*[SIZE];
  double** c = new double*[SIZE];
  for(int i = 0; i < SIZE; ++i) { 
    a[i] = new double[SIZE];
    b[i] = new double[SIZE];
    c[i] = new double[SIZE];
  }
#else 
  double a[SIZE][SIZE];
  double b[SIZE][SIZE];
  double c[SIZE][SIZE];
#endif
  int	tid, nthreads, i, j, k, chunk;
  /*** Initialize matrices ***/
  for (i=0; i<SIZE; i++)
    for (j=0; j<SIZE; j++)
      a[i][j]= i+j;
  for (i=0; i<SIZE; i++)
    for (j=0; j<SIZE; j++)
      b[i][j]= i*j;
  for (i=0; i<SIZE; i++)
    for (j=0; j<SIZE; j++)
      c[i][j]= 0.0;
    double t1 = omp_get_wtime();
  for (int s=0; s<1000; s++){
  	for (i=0; i<SIZE; i++)   
    {
    	for(j=0; j<SIZE; j++)       
      		for (k=0; k<SIZE; k++)
        		c[i][j] += a[i][k] * b[k][j];
    }

	}
  double t2 = omp_get_wtime() - t1;
  cout<<t2<<endl;
}
int main(int argc, char** argv) {
  int size = 64; 
  if(argc > 1) size = atoi(argv[1]);
  warmup2(size);
  return 0;
}

run results:

Matrix size 128

taskset -c 0 ./dgemm 128. [A57]
9.22673
taskset -c 1 ./dgemm 128. [Denver]
20.4897
Matrix size 64

taskset -c 0 ./dgemm 64 [A57]
0.792225
taskset -c 1 ./dgemm 64 [Denver]
0.618031

System config:

SOC family:tegra186 Machine:quill

Online CPUs: 0-5

CPU Cluster Switching: Disabled

cpu0: Gonvernor=performance MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200

cpu1: Gonvernor=performance MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200

cpu2: Gonvernor=performance MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200

cpu3: Gonvernor=performance MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200

cpu4: Gonvernor=performance MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200

cpu5: Gonvernor=performance MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200

GPU MinFreq=114750000 MaxFreq=1300500000 CurrentFreq=114750000

EMC MinFreq=40800000 MaxFreq=1866000000 CurrentFreq=1866000000 FreqOverride=1

Fan: speed=0

Hi,
We have observed the same on TX2. Please refer to
https://elinux.org/Jetson/L4T/r32.4.x_patches
[TX2] Denver cores not working on TX2

Hi, Thanks for your response.
I see that Denver cores are enabled and set to high frequency. Also mode 0 is already set (nvpmodel -m 0). When I use taskset to use core 1 the application runs on Denver. Is there any difference in memory services of the two cluster which favour A57 on size 128 and above but not Denver?

Hi,
We have similar observation in comparing A57 and Denver cores. This looks to be a limitation in hardware architecture and may impact benchmark. So from JP4.4(r32.4.3), we mainly schedule tasks to A57 cores.