SSE2 on AMD and Intel - Differing performance Slightly off-topic


I implemented small matrix multiplication using SSE2. It runs twice fast on my Intel CPUs. But on AMD, the implementation runs slower…
Yes, It is double-precision math…

The matrices are 20x20 in dimension… Both matrices can be fully contained in the caches.

The INtel L1 cache is 8-way associative whereas AMD is 2-way associative (but more sets in case of AMD).

I use non-temporal writes to write out the result matrix so that Caches are not destroyed. ( - for info on non-temporal writes)

But still, i wonder why AMD Athlon X2 dual core is not enjoying SSE2.

Any ideas?

Make a “speed of light” memory variant of your code. In this case, it means making a version of your code that is just the same but clearly is NOT memory bound.
You might replace every add, accumulate, mult, etc, effectively changing from a[i]=b[i]c[i] to a=bc. This will end up using the same flops and control loops, but keep all data in registers. The speed difference between the two will show if it’s memory access causing your speed issues or not.

This strategy is endlessly useful in performance coding for CPU, SSE, and CUDA… it’s a first order test to find your bottleneck… memory? computation? latency? etc.

The one important game you have to play is to create enough chained dependencies in your math to prevent the compiler from falsely optimizing the work away, though.
So for example with the a=b*c example, that could be optimized by the compiler by unrolling the loop and keeping only the last evaluation. You have to think of ways to accumulate or mix data so it’s equivalent in terms of math and control effort but can’t be optimized by a smart compiler.

Just because AMD and Intel chips both implement SSE2 does not mean that they implement them the same way in the hardware logic. If the matrices fit into the processor cache, then that’s got to be the answer (unless there’s something else you haven’t mentioned)…or there is something funny about the way AMD is accessing their cache that’s causing your code to slow down.

If this is an absolutely critical piece of code, I’d contact AMD and send them an example and some test results to see if they can tell you what the problem is.

You might want to consider taking a look at some of the performance counters available on AMD processors. Specifically cache misses might be insightful. Just because your data set can fit in the L2 cache does not mean that you are not thrashing the L1 or instruction cache. There are a lot of low level factors that could be hurting you here. I have seen cases, for example, where a critical instruction was misaligned in the instruction cache (spread across 2 lines) such that only one instruction could be fetched per cycle by a decode unit that is normally able to handle 2. Adding a nop before the instruction so that the inner loop would be packed into a single cache line literally doubled the performance of the code.

Unless you are doing this as an exercise for your own education I would strongly suggest that you just use an implementation from AMD’s math kernel library, or a similar high performance library like GOTO or ATLAS. These can usually hit 70-90% of the theoretical GFLOPs of a particular cpu.


I will try out what you said… That sounds like a plan.

Most likely true… I think the L1 is being trashed although I use non-temporal writes for writing results…

This is the response I got from AMD forums as well. And Athlons use 2 cycles for processing 128-bit data than Phenoms that use 1 clock cycle.

Also, they also said the L1 cache bandwidth is also lesser for Athlons…

That sounds absolutely crazy… PowerPCs should fare better as they have fixed instruction lengths.

Anyway, THANKS a lot for this tip!

Yup! It is a trampoline to speed up some other code. I have ACML downloaded and sleeping… Got to look into it. THANKS!

btw, Windows 7 will have the HCP API for reading hardware counters – so that I can forget the differences between intel and amd. I hope that would help.