Latency depends on the instruction in question. usually division has lot of latency than multiplication.
The Intel Optimization guide has a list of latencies for various micro architectures and for all instructions. Note that division latency has considerably improved in the latest i7… but still manifold the latency of mul.
The guide defines “latency” as total clock cycles taken to execute the instruction.
It defines “throughput” as number of clock cycles to WAIT until the same instruction type can be issued again.
Also note that CPU optimization also comes from engaging all the execution units parallely. For example , an FP MUL and FP ADD can happen at the same time because they operate on independent execution units… Modern Intel cores have 6 executipn units… So, if u re-arrange your code intelligently and make the CPU busy, you can even hide your L1 cache latency… Intelligent pre-fetching can make you totall latency free…
These are the kind of thinking that goes into design of libraries like MKL.