why in groups of 32 parallel threads ?

The 9800 GT (for example) has 14 multiprocessors .

each multiprocessor has 8 processors.

to execute a warp (32 parallel threads), will 4 multiprocessors be used (8*4=32) ?

where does this number “32” comes from physically ?

This comes from the 8 ALUs (“processors”) that execute a single instruction 4 times during a minimum of 4 clock cycles, resulting in 32 threads executing the same instruction.

It gets even more confusing, considering that for memory transactions so-called “half warps” of 16 threads become relevant. Bank conflicts and coalescing rules apply to memory transactions initiated during one half warp.

I guess nVidia chose the 32 figure because it might be considered the upper bound of the number ALUs that nVidia might eventually cram into a single multiprocessor.

“Oversubscribing” the ALUs with independent threads also reduces the likelihood of pipeline hazards (where one instruction needs the results of the previous one) and allows you to run the instruction decoder logic at a slower clock rate (less power usage) than the ALUs. I believe this is why the core clock on the multiprocessor is usually around half the clock rate of the ALU.