Need hints to get good memory bandwidth with pgcc-10

First some background. I’m running on a centos-5.4 x86-64 system with PGI-10. The hardware is a dual socket E5530 with 24GB ram. Each of the 2 sockets has 8GB of shared L3. Each of the 4 cores per CPU (total 8) has 256KB of dedicated L2, and 32KB/32KB of I/D in L1.

My code runs 1-8 threads over a range of array sizes that exercise the various caches, the inside loops are dreadfully simple, sum and triad respectively:

for (i = 0; i < size; i++)
{
    c[i] = a[i] + b[i];
}
...
for (i = 0; i < size; i++)
{
    a[i] = b[i] + scalar * c[i];
}

Now for a baseline, gcc-4.4.2, compiled with only the -funroll-all-loops and -O4 optimization flags.

External Media

Plainly visible is main memory bandwidth, L3, L2, and L1 for 1,2,4, and 8 threads.

Main memory scales from 7GB/sec up to 23GB/sec or so.

L3 is visible in the 1 thread line at 8MB as the bandwidth goes from 7 to 15.5GB/sec, and in the 2,4,8 threads graphs at 16MB. This make sense since there are only two L3s. Interestingly the L3 shows that additional threads allow hiding the latency of L3, with 2 threads 32GB/sec is managed in L3, and 4 and 8 threads show 64GB/sec.

L2 is clearly visible at 256k, 512k, 1MB, and 2MB with 1,2,4 and 8 threads respectively. In the single thread run with 3 idle cores the L3 can almost keep up with L2. So the improvement is rather low with 1 and 2 threads (3 and 6 idle cores respectively). But with 8 threads fighting for L3 it’s clear that the L2 is a big help taking the bandwidth from 77GB/sec to 180GB/sec.

L1 is barely visible in the 1T thread at the very left of the graph, but is easily seen at 64, 128k, and 256k thresholds for 2, 4, and 8 threads respectively. All classic textbook graphs for these things.

Now for the problems. I played with tons of flags with pgcc and ended up with one of 2 graphs:

External Media

Very strange indeed, good main memory and L1 bandwidths (noticeably better than gcc). The the size of L2 is halved for all add and triad lines. Even more strangely is that the L1 seems halved only for the triad lines (but not the add lines).

It was compiled with:
-O3 -Msafeptr -fastsse

Next up is:

External Media

Compiled with -Mscalarsse -Mcache_align -O3 -Msafeptr.

Main memory looks good, L3 looks good, L2 looks good… but no L1!

Any hints on how to handle this? A, B, and C are double * that are allocated dynamically (to allow the constantly changing array size). What would the optimal alignment be? I’m making sure they are 128 byte aligned. How about the alignment between A, B, and C. Seems like you’d want them off by somewhere around 1/3rd of L1 to insure that read/writing A doesn’t land on the same cache line as part of B.

Anything else? How about compiler flags? Any suggested way to get a nice healthy plot like gcc’s? Is accessing half of L1/L2 somehow faster? Does PGCC somehow statically partition the L1/L2 caching for hyperthreading even if the hyperthreading isn’t used?

I tried removing my pointer alignment tricks and it made the graphs noisier, but no better.

Hints, ideas, and suggestions welcome. Oh, BTW, -Msafeptr is awesome, I frequently see substantial performance hits when accessing dynamic arrays with pathscale, gcc, and open64 compilers. -Msafeptr makes malloc’d arrays just as fast as statically allocated arrays in my tests.

Hi Bill,

Hints, ideas, and suggestions welcome. Oh, BTW, -Msafeptr is awesome, I frequently see substantial performance hits when accessing dynamic arrays with pathscale, gcc, and open64 compilers. -Msafeptr makes malloc’d arrays just as fast as statically allocated arrays in my tests.

Cool stuff. I’ll start at the end. “-Msafeptr” has the effect of adding the C99 “restrict” keyword to all your pointers. In other words, it tells the compiler that no pointers overlap and gives it the go ahead to perform vectorization (-Mvect or -fastsse), auto-parallelization (-Mconcur), and GPU acceleration (!$acc). Without restrict or -Msafeptr, the compiler must assume can pointers overlap. While -Msafeptr is useful, it’s better programing practice to use the restrict keyword.

“-fastsse” is an aggregate flag, i.e. a set of other flags. While it can change a bit from release to release and platform to platform, in our most current compiler (2010) on 64-bit Linux with a Nehalem processor, “-fastsse” is composed of " -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mpre -Mvect=sse,altcode,prefetch -Mscalarsse -Mcache_align -Msmart". For your tests, the major differences between the graphs is most likely due to vectorization (-Mvect).

With altcode generation, the compiler will generate different code for each loop. The exact code executed will be determined by the loop size and cache size. My guess is this why you’re seeing the big sudden drops.

I would baseline with “-fastsse” then try the following sets. Note that if you are using OpenMP, add “-mp=align”.

  • -fastsse -Mnontemporal
    -fastsse -Mvect=noatlcode
    -fastsse -Mprefetch=nta
    -fastsse -Mprefetch=t0
    -fastsse -Mprefetch=distance:8

Other flagsets to might try are but may not help:

  • -fastsse -Mipa=fast,inline
    -fastsse -Mvect=cachesize:8388608
    -fastsse -Mvect=short

Let us know how things go!

  • Mat

Cool stuff. I’ll start at the end. “-Msafeptr” has the effect of adding the C99 “restrict” keyword to all your pointers. In other words, it tells the compiler that no pointers overlap and gives it the go ahead to perform vectorization (-Mvect or -fastsse), auto-parallelization (-Mconcur), and GPU acceleration (!$acc). Without restrict or -Msafeptr, the compiler must assume can pointers overlap. While -Msafeptr is useful, it’s better programing practice to use the restrict keyword.

Interesting, thanks for the additional information. I tried the restrict keyword before without much effect, but it was awhile ago and I don’t remember with which compilers. It was very frustrating to not overlap pointers but pay the performance penalty anyways.

“-fastsse” is an aggregate flag, i.e. a set of other flags. While it can change a bit from release to release and platform to platform, in our most current compiler (2010) on 64-bit Linux with a Nehalem processor, “-fastsse” is composed of " -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mpre -Mvect=sse,altcode,prefetch -Mscalarsse -Mcache_align -Msmart". For your tests, the major differences between the graphs is most likely due to vectorization (-Mvect).

With altcode generation, the compiler will generate different code for each loop. The exact code executed will be determined by the loop size and cache size. My guess is this why you’re seeing the big sudden drops.

I studied the various mentions of altcode and -MVect on the manpage. Seems like in all my cases it gets things fairly wrong. Out of my 9 runs the only 2 that got the caches right were the one where the cache size is explicitly set and the other where altcode is turned off.

So with a simple code:

for (i = 0; i < size; i++)
{
    c[i] = a[i] + b[i];
}

As I dynamically (at runtime) change size difference code will be run? Interesting, I had wondered if I’d have to do that manually, then started wondering if I could trigger different optimizations for arrays that fit in each level of cache. Sounds like the answer to that is altcode.

From the graphs it seems as if compiler is a bit off on some of the thresholds. I ran all of your suggested cases, er, at least what I think you wanted. I compiled with -Msafeptr in all cases:

  • OO OPT=-fastsse
    OA OPT=-fastsse -Mnontemporal
    OB OPT=-fastsse -Mvect=noaltcode
    OC OPT=-fastsse -Mprefetch=nta
    OD OPT=-fastsse -Mprefetch=t0
    OE OPT=-fastsse -Mprefetch=distance:8
    OF OPT=-fastsse -Mipa=fast,inline
    OG OPT=-fastsse -Mvect=cachesize:8388608
    OH OPT=-fastsse -Mvect=short

They are all at:
http://cse.ucdavis.edu/bill/s1/

The most interesting (IMO) are with the explicit cache size:

External Media

The explicitly set cachesize did exactly the opposite of what I expected. In the 1 and 2 thread runs the L3 cache size is off by a factor of two. Even stranger is that it seems to have fixed the L1 and L2 sizes.

The no altcode run:

External Media

It looks really good, the main problem looks like main memory which is 22.75GB/sec instead of the normal 29.5GB/sec in most of the other graphs.

So basically altcode looks really interesting, a very promising technology, but at least for how I use it it seems to usually halves the size of each level of cache.

I would baseline with “-fastsse” then try the following sets. Note that if you are using OpenMP, add “-mp=align”.

I’m using pthreads to vary the number of threads, not OpenMP.