No, the biggest GPUs have 16 multi-processors. They have 8 ALU per multi-processor (which are able to execute a warp of 32threads in total) and that’s where the number 128 you read around commes from (16x8). But calling a GeForce a 128 cores processor, is aking to calling a Intel Quad Core 2 “16 core cpu” because its SIMD unit found on each of its core (the SSE) can each process up to 4 floats in parallel.
This has some implication on diverging codepaths (all units in the same processors execute the same instruction. If code path diverge, you can’t run the 2 different paths in parallel inside the same SIMD processor).
And although the execution may diverege between the 16 processors, the architecture is still SPMD (single program, multiple data) meaning that all processors will run the same kernel any way.
That why, I think, nVidia engineers don’t seem as fond as Intel’s about Ray-Tracing.
The processors like Cell and PhysX follow a different architecture. Each of the separate core can run a different program. Each of the core has a small local very fast memory (similar to the GPU) and each core can communicate and exchange quickly data with the others using DMA. This enables having each core executing a different step of a pipeline and data being quickly streamed across the whole stack. (This is something that GeForce can’t do at all. They can only load/store data to the device memory - they can’t exchange data with each other. And they can’t stream a constant flux of data, only apply a kernel to a block of memory defined before hand [although the same effect can be reached by using lots of small buffers]. Also, because of their SPMD approach, they can’t have each processor run a separate step of a pipeline).
Each architecture has it advantage and drawbacks, and conclusion about which architecture is best suited to which situation is better left open for the professional engineers to discuss.
But I suspect that most of the work done by a physics engine doesn’t necessarily benefit at all from a Cell-/PhysX-like architecture, as finding collision seems just a big number crunching problems with not that much discrete steps. It’s mostly lots of geometric computation, for which GPU are already pretty much nicely optimised, and you get the added benefit that all the data is already on GPU for subsequent rendering.
The Cell-/PhysX-like architecture seem better suited for complex task involving, for example, decompressing video / applying filters / recompressing in real-time.
But on the other hand, I’m not expert, and perhaps the CUDA implementation is faster only because of higher clock frequency and proximity between the physics computation and renderer.