SSE2 on AMD and Intel - Differing performance Slightly off-topic

Sarnath · August 7, 2009, 12:18pm

All,

I implemented small matrix multiplication using SSE2. It runs twice fast on my Intel CPUs. But on AMD, the implementation runs slower…
Yes, It is double-precision math…

The matrices are 20x20 in dimension… Both matrices can be fully contained in the caches.

The INtel L1 cache is 8-way associative whereas AMD is 2-way associative (but more sets in case of AMD).

I use non-temporal writes to write out the result matrix so that Caches are not destroyed. (Memory part 5: What programmers can do [LWN.net] - for info on non-temporal writes)

But still, i wonder why AMD Athlon X2 dual core is not enjoying SSE2.

Any ideas?

SPWorley · August 8, 2009, 2:27pm

Make a “speed of light” memory variant of your code. In this case, it means making a version of your code that is just the same but clearly is NOT memory bound.
You might replace every add, accumulate, mult, etc, effectively changing from a[i]=b[i]c[i] to a=bc. This will end up using the same flops and control loops, but keep all data in registers. The speed difference between the two will show if it’s memory access causing your speed issues or not.

This strategy is endlessly useful in performance coding for CPU, SSE, and CUDA… it’s a first order test to find your bottleneck… memory? computation? latency? etc.

The one important game you have to play is to create enough chained dependencies in your math to prevent the compiler from falsely optimizing the work away, though.
So for example with the a=b*c example, that could be optimized by the compiler by unrolling the loop and keeping only the last evaluation. You have to think of ways to accumulate or mix data so it’s equivalent in terms of math and control effort but can’t be optimized by a smart compiler.

jack · August 8, 2009, 4:38pm

Just because AMD and Intel chips both implement SSE2 does not mean that they implement them the same way in the hardware logic. If the matrices fit into the processor cache, then that’s got to be the answer (unless there’s something else you haven’t mentioned)…or there is something funny about the way AMD is accessing their cache that’s causing your code to slow down.

If this is an absolutely critical piece of code, I’d contact AMD and send them an example and some test results to see if they can tell you what the problem is.

Gregory_Diamos · August 9, 2009, 2:46am

You might want to consider taking a look at some of the performance counters available on AMD processors. Specifically cache misses might be insightful. Just because your data set can fit in the L2 cache does not mean that you are not thrashing the L1 or instruction cache. There are a lot of low level factors that could be hurting you here. I have seen cases, for example, where a critical instruction was misaligned in the instruction cache (spread across 2 lines) such that only one instruction could be fetched per cycle by a decode unit that is normally able to handle 2. Adding a nop before the instruction so that the inner loop would be packed into a single cache line literally doubled the performance of the code.

Unless you are doing this as an exercise for your own education I would strongly suggest that you just use an implementation from AMD’s math kernel library, or a similar high performance library like GOTO or ATLAS. These can usually hit 70-90% of the theoretical GFLOPs of a particular cpu.

Sarnath · August 11, 2009, 5:53am

@SPWOrley,

I will try out what you said… That sounds like a plan.

Most likely true… I think the L1 is being trashed although I use non-temporal writes for writing results…

This is the response I got from AMD forums as well. And Athlons use 2 cycles for processing 128-bit data than Phenoms that use 1 clock cycle.

Also, they also said the L1 cache bandwidth is also lesser for Athlons…

That sounds absolutely crazy… PowerPCs should fare better as they have fixed instruction lengths.

Anyway, THANKS a lot for this tip!

Yup! It is a trampoline to speed up some other code. I have ACML downloaded and sleeping… Got to look into it. THANKS!

btw, Windows 7 will have the HCP API for reading hardware counters – so that I can forget the differences between intel and amd. I hope that would help.

http://msdn.microsoft.com/en-us/library/dd796395(VS.85).aspx

Topic		Replies	Views
Fermi cache performance L1 vs L2 cache CUDA Programming and Performance	0	781	May 1, 2010
Intel paper: Debunking the 100X GPU vs. CPU myth CUDA Programming and Performance	36	25329	April 7, 2011
300x to 600x times faster... really? CUDA Programming and Performance	92	34518	February 8, 2010
Hand-Tuned SGEMM on GT200 GPU 10% ~ 20% improvement of SGEMM CUDA Programming and Performance	39	69296	March 1, 2011
Matrix multiplication: Two codes with similar assembly, but different performance? CUDA Programming and Performance cuda , performance	10	88	May 25, 2025
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37528	August 30, 2009
Even more Fermi Fun: Uncoalesced writes CUDA Programming and Performance	8	8891	June 5, 2010
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11222	May 23, 2010
Fermi question CUDA Programming and Performance	30	5590	May 26, 2010
Find out more opportunities for accelerating SpMM using sparse tensor cores CUDA Programming and Performance cuda , kernel	5	473	March 24, 2024

SSE2 on AMD and Intel - Differing performance Slightly off-topic

Related topics