First some background. I’m running on a centos-5.4 x86-64 system with PGI-10. The hardware is a dual socket E5530 with 24GB ram. Each of the 2 sockets has 8GB of shared L3. Each of the 4 cores per CPU (total 8) has 256KB of dedicated L2, and 32KB/32KB of I/D in L1.
My code runs 1-8 threads over a range of array sizes that exercise the various caches, the inside loops are dreadfully simple, sum and triad respectively:
for (i = 0; i < size; i++)
{
c[i] = a[i] + b[i];
}
...
for (i = 0; i < size; i++)
{
a[i] = b[i] + scalar * c[i];
}
Now for a baseline, gcc-4.4.2, compiled with only the -funroll-all-loops and -O4 optimization flags.
Plainly visible is main memory bandwidth, L3, L2, and L1 for 1,2,4, and 8 threads.
Main memory scales from 7GB/sec up to 23GB/sec or so.
L3 is visible in the 1 thread line at 8MB as the bandwidth goes from 7 to 15.5GB/sec, and in the 2,4,8 threads graphs at 16MB. This make sense since there are only two L3s. Interestingly the L3 shows that additional threads allow hiding the latency of L3, with 2 threads 32GB/sec is managed in L3, and 4 and 8 threads show 64GB/sec.
L2 is clearly visible at 256k, 512k, 1MB, and 2MB with 1,2,4 and 8 threads respectively. In the single thread run with 3 idle cores the L3 can almost keep up with L2. So the improvement is rather low with 1 and 2 threads (3 and 6 idle cores respectively). But with 8 threads fighting for L3 it’s clear that the L2 is a big help taking the bandwidth from 77GB/sec to 180GB/sec.
L1 is barely visible in the 1T thread at the very left of the graph, but is easily seen at 64, 128k, and 256k thresholds for 2, 4, and 8 threads respectively. All classic textbook graphs for these things.
Now for the problems. I played with tons of flags with pgcc and ended up with one of 2 graphs:
Very strange indeed, good main memory and L1 bandwidths (noticeably better than gcc). The the size of L2 is halved for all add and triad lines. Even more strangely is that the L1 seems halved only for the triad lines (but not the add lines).
It was compiled with:
-O3 -Msafeptr -fastsse
Next up is:
Compiled with -Mscalarsse -Mcache_align -O3 -Msafeptr.
Main memory looks good, L3 looks good, L2 looks good… but no L1!
Any hints on how to handle this? A, B, and C are double * that are allocated dynamically (to allow the constantly changing array size). What would the optimal alignment be? I’m making sure they are 128 byte aligned. How about the alignment between A, B, and C. Seems like you’d want them off by somewhere around 1/3rd of L1 to insure that read/writing A doesn’t land on the same cache line as part of B.
Anything else? How about compiler flags? Any suggested way to get a nice healthy plot like gcc’s? Is accessing half of L1/L2 somehow faster? Does PGCC somehow statically partition the L1/L2 caching for hyperthreading even if the hyperthreading isn’t used?
I tried removing my pointer alignment tricks and it made the graphs noisier, but no better.
Hints, ideas, and suggestions welcome. Oh, BTW, -Msafeptr is awesome, I frequently see substantial performance hits when accessing dynamic arrays with pathscale, gcc, and open64 compilers. -Msafeptr makes malloc’d arrays just as fast as statically allocated arrays in my tests.