Kirk and Hwu

In the book “Programming Massively Parallel Processors” by Kirk and Hwu, they discuss a case study or practical optimization in Chapter 7 of a MRI (magnetic resonance imaging) case. They present some c code which I will not show. They state:

Despite the algorithm’s abundant inherent parallism, potential performance bottlenecks are evident. First, in the loop that computes the elements of FHd (H is a exponent), the ratio of floating point operations to memory accesses is at best 3:1 and at worst 1:1. I understood when the ratio of floating point to memory accesses was defined in the earlier chapters, but I do not see 3:1 and 1:1 at best here. Exactly what section of code are they talking about?

The worst case assumes that sin and cos trigonometry are computed using five-element Taylor series that require 13 and 12 floating point operations, respectively. The worst case assumes that each trigonometric operaton is computed in a single operation in hardware. I know what a Taylor series is, but it is unclear how that is consuming so much time. Of course, computing in hardware is more efficient that computing in software. So what are they saying is wrong here?

Finally, they state that the ratio of FP arithmtetic to FP trigonometry functions is only 13:2. FP stands for floating point, but how is calcuating sin and cos by Taylor series contributing to long latency? Uisng a Taylor series will add time, but also contributing to long latency?

I am not just seeing their argument. Any help appreciated.