flops , hz , cycles , am i missing something in my calculations

i wanted to do a matrix mul on GPU before that plz cear some supercomputing terms t me , or am missing anything

i’ve a xeon E 5470 , dual socket processors , 4 cores on each one @ 3 Ghz

as per an anandtech article , E5472 fares 0.3 flops/cycle on GCC with O3 optimisation

thants makes 3 * 10^9 * 0.3 FLOPS per second , =>9 * 10^8 FLOPs

keeping this in mind , i run a simple matrix multiplication program on CPU

a 1000x1000 matrix :
this would require , 3 * 10^3 * 10^3 (for 3 matrices which r involved in multiplication)
and 2 Floting point instructions per operation as addition and mul is involved
this makes it 2310^6 = 6 * 10^6

total time it must’ve taken is 6*10^6 / 9 * 10^8 = 0.006 sec

but actually its taking 4.4 seconds on gcc with O3 optimisation

Am i missing something Plz Plz help soooooooooooooooonn

thanks in advance

i’ve attached my code

Hi. Yes, there is something missing. I’m not sure about the Xeon processor specifically, but it takes many processors considerably longer to perform a floating point multiplication than it does to perform an addition. Also, unless you specifically write your code to utilize all four cores of your CPU, only one core is going to get used. Also, the article may have been talking about how many flops can be achieved by utilizing SSE extensions, which are generally four times as fast as programs written without SSE extensions. There are many, many other things which affect CPU performance, which I won’t go into :P