Make a “speed of light” memory variant of your code. In this case, it means making a version of your code that is just the same but clearly is NOT memory bound.
You might replace every add, accumulate, mult, etc, effectively changing from a[i]=b[i]c[i] to a=bc. This will end up using the same flops and control loops, but keep all data in registers. The speed difference between the two will show if it’s memory access causing your speed issues or not.
This strategy is endlessly useful in performance coding for CPU, SSE, and CUDA… it’s a first order test to find your bottleneck… memory? computation? latency? etc.
The one important game you have to play is to create enough chained dependencies in your math to prevent the compiler from falsely optimizing the work away, though.
So for example with the a=b*c example, that could be optimized by the compiler by unrolling the loop and keeping only the last evaluation. You have to think of ways to accumulate or mix data so it’s equivalent in terms of math and control effort but can’t be optimized by a smart compiler.
Just because AMD and Intel chips both implement SSE2 does not mean that they implement them the same way in the hardware logic. If the matrices fit into the processor cache, then that’s got to be the answer (unless there’s something else you haven’t mentioned)…or there is something funny about the way AMD is accessing their cache that’s causing your code to slow down.
If this is an absolutely critical piece of code, I’d contact AMD and send them an example and some test results to see if they can tell you what the problem is.
You might want to consider taking a look at some of the performance counters available on AMD processors. Specifically cache misses might be insightful. Just because your data set can fit in the L2 cache does not mean that you are not thrashing the L1 or instruction cache. There are a lot of low level factors that could be hurting you here. I have seen cases, for example, where a critical instruction was misaligned in the instruction cache (spread across 2 lines) such that only one instruction could be fetched per cycle by a decode unit that is normally able to handle 2. Adding a nop before the instruction so that the inner loop would be packed into a single cache line literally doubled the performance of the code.
Unless you are doing this as an exercise for your own education I would strongly suggest that you just use an implementation from AMD’s math kernel library, or a similar high performance library like GOTO or ATLAS. These can usually hit 70-90% of the theoretical GFLOPs of a particular cpu.