Number of 64 bit floating point operations per clock cycle?

According to the table “Number of Operations per Clock Cycle per Multiprocessor” from the CUDA C programming Guide each multiprocessor of my GPU (GTS 450) can do 4 “64-bit floatingpoint add, multiply, multiply-add” per clock cycle. What that really means?
Does it mean that each multiprocessor has just 4 64-bit ALUs?
If I have a block with 32 threads doing double precision floating point operations only 4 will execute in parallel and the others will have to wait until the next clock cycle?

There are two versions of the GTS 450. The GTS 450 has 4 SMs. The GTS 450 OEM has 3 SMs. (a cc 2.1 SM has 48 cuda “cores”):

Yes, a cc 2.1 multiprocessor (SM) has 48 SP “units” and 4 DP “units” (you can call them ALUs if you want):

Yes, that means each SM is able to “retire” up to 4 DP FMA instructions per cycle. A full warp of DP FMA operations would take at least 8 cycles to retire.

Looking at the entire chip, there are either 3 or 4 SMs, so it could retire up to 12 or 16 DP FMA instructions per cycle.

In general, with the exception of the Titan line, GeForce products are not principally designed to deliver double precision floating point performance. Ordinary DX or OGL graphics has no use for double-precision floating point. Furthermore, GTS 450 is a fairly low-end GPU in the previous “Fermi” generation of GPUs. “Kepler” GPUs (cc 3.0/3.5) have been shipping for about 2 years now. “Maxwell” GPUs (cc 5.0) just started shipping a few months ago.

DP = double precision = 64-bit = “double”
SP = single precision = 32-bit = “float”

I have some code that runs slower on the GPU than on the CPU for many reasons, like lots of conditional branches, dynamic memory allocation and double precision floating point operations, and I just want to explain why it’s so slow and why running it on the GPU is not a good idea.