Memory access tunning between CPU clusters

Hi there,

I’m doing some multi-threading service performance tunning work on AGX Orin. I found that there’s a 4MB SLC described in Orin TRM.

Can this cache be shared by different CPU clusters? Or if I bind different service thread on different CPU cluster, is there a level 4 cache that service program can take advantage to optimize multi-thread memory sharing?

Thanks

Sorry for the late response, I will forward this question to our team see if can share some information. Thanks

Hello kayccc,

Thank you for help.

I wrote a simple program to test the mem sharing performance between CPU clusters.

The idea is to create a pair of producer and consumer thread with spinlocks to access the same memory data, and bind each thread on two different cluster CPUs, and then get the performance data with perf.

The test program is this:

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sched.h>

#define LOOPS 1000000

struct cache_mem {
        int data;
        int reserved[15];
} __attribute__((aligned(64)));

static struct cache_mem cache_mem;

static pthread_spinlock_t spin_rlock, spin_wlock;

void *cache_writer(void *ptr)
{
        volatile struct cache_mem* p_mem = (struct cache_mem*)ptr;
        int data;

        cpu_set_t cpuset; 

        CPU_ZERO(&cpuset);
        CPU_SET(0, &cpuset);

        sched_setaffinity(0, sizeof(cpuset), &cpuset);

        for (int i = 0; i <= LOOPS; i++) {
                pthread_spin_lock(&spin_wlock);
                p_mem->data = i;
                pthread_spin_unlock(&spin_rlock);
        }
}

void *cache_reader(void *ptr)
{
        volatile struct cache_mem* p_mem = (struct cache_mem*)ptr;
        int data;

        cpu_set_t cpuset; 

        CPU_ZERO(&cpuset);
        CPU_SET(1, &cpuset);

        sched_setaffinity(0, sizeof(cpuset), &cpuset);

        for (int i = 0; i <= LOOPS; i++) {
                pthread_spin_lock(&spin_rlock);
                data = p_mem->data + i;
                pthread_spin_unlock(&spin_wlock);
        }
}

int main(void)
{
        pthread_t thread1, thread2;
        char *message1 = "Thread 1";
        char *message2 = "Thread 2";
        int  iret1, iret2;

        printf("cache_mem size: %ld\n", sizeof(cache_mem));

        iret1 = pthread_create(&thread1, NULL, cache_writer, (void*)&cache_mem);
        iret2 = pthread_create(&thread2, NULL, cache_reader, (void*)&cache_mem);

        pthread_join( thread1, NULL);
        pthread_join( thread2, NULL);

        printf("Thread 1 returns: %d\n",iret1);
        printf("Thread 2 returns: %d\n",iret2);
    
        return 0;
}

The cpu affinity can be set with the first param in CPU_SET() function.

With this test program, I made the follow two tests, and the perf result shows like below.

Test-1: Bind two process on CPU1,CPU2 (same CPU cluster)
--------------------------------------------------

     1,280,958,048      cycles                                                                
     1,038,099,050      stall                                                                 
       740,295,337      stall_backend_mem                                                     
         9,252,513      LLC-loads                                                             
            21,747      LLC-load-misses                  #    0.24% of all L1-icache accesses 

       0.302212145 seconds time elapsed

       0.587178000 seconds user
       0.007941000 seconds sys
Test-2: Bind two process on CPU1,CPU6 (different CPU cluster)
--------------------------------------------------

     4,871,339,259      cycles                                                                
     3,925,456,873      stall                                                                 
     3,459,301,975      stall_backend_mem                                                     
         9,270,005      LLC-loads                                                             
         9,218,631      LLC-load-misses                  #   99.45% of all L1-icache accesses 

       1.120535571 seconds time elapsed

       2.230693000 seconds user
       0.003987000 seconds sys

Test 1 shows almost all L3 cache access hits, and Test 2 shows almost all L3 cache misses (which is to be expected, and should be hit in the SLC)

If the SLC can be shared by the two CPU clusters, I think there would not be such a performance gap between the two tests.

The Test 2 shows very low performance when two thread run on different CPU clusters.

I’m not sure if it is normal, or if there are mistakes to my understanding.

If it is OK, please also help to forward these test data to the tech team, it may give more detail information about my question.

Thank you.

Cyrus Huang

Some updateds from internal team:
I think SLC cache is external marketing term for SCF cache. SCF is POC for CPU cluster and soc devices. Some of cache ways are reserved for GPU to improve GPU perf, rest are assigned to CPU. I don’t think LLC-load-misses includes SLC cache

The SLC in SCF is shared between CPU clusters, but sharing within a cluster DSU cache will be significantly faster than sharing between clusters via SLC.

Thank you for help.
After some local test, we also found the same conclusion as you said: “sharing within a cluster DSU cache will be significantly faster than sharing between clusters via SLC”.
So for our case, we’ll put more related processes to inner cluster to avoid the cacheline sharing between different clusters to improve the performance.

BR,
Cyrus Huang

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.