First I think CUDA is the generalization of GPU model, however now I realized there’s some feature in GPU that CUDA is missing. Obviously, the support of SIMD function.
We have float2, float3,float4 but no supported SIMD function for these structure.
Swizzling is another missing features. So actually I wonder, there should be the case that GPU version of the problem will be faster than CUDA version by exploiting the SIMD functions. In that case could we use GPU and CUDA at the same time. Like can I use the CUDA to load data to the device and use GPU shader program to process data, and return result back to CPU via CUDA
super-scalar means it scales superly. g80 is just scalar, which means it doesn’t use vectors. actually, super-scalar means that a single thread of execution gets rechopped by the cpu as it’s executed to figure out which parts of it can be run in parallel. this wastes swaths of transistors and the whole point of gpus is that they’re not super-scalar.
also: if you remember the 7900 it had 16 pipelines with 2 x float4 ALUs. The 8800 has 16 multiprocessors with 8 float1 ALUs. What nvidia (and also ati) realized is that all those SIMD instructions would work just as well if you considered each component as its own thread. Thing is that all the threads have to execute the same exact instruction anyway (it’s always been that way. that’s why branching has to be emulated), so it’s almost like an illusion and it’s the same thing either way. G80 could still even be called SIMD, although nvidia uses 'MIMD" (which is kind of dumb). I have to say this was rather clever, though i’m not sure why transistor counts had to double in G80 (longer pipelines for higher clocks?).
The G80 can’t do a dot product in one cycle, because it’s fully scalar. It now takes 4 cycles. A Core 2 Duo, btw, can do a vect4 multiply-add in one cycle cycle (single precision)–in each of its cores. It’s also got a Ghz advantage. In terms of peak flops, G80 is about 10x a Core 2 Duo (and only 5x than a Quad). Memory bandwidth is also 10x, but cache size is more like 1/10th.
The two things that characterize GPU architectures are:
small caches. more transistors are devoted to processing.
massive parallelism. It’s much more efficient to execute four or eight thousand threads at once with each thread running slowly than to execute 2 threads with each thread running very quickly.
The ability to fit 128 scalar processing units is a side-effect of the free space and the efficiency that comes with massive hyperthreading.
EDIT: nevermind about small caches. The total transistors devoted to SRAM is actually pretty large, but the bits get used in many ways that are ineffient for our purposes. There are half a dozen different cache types (texture, vertex, constants, output, z, stencil) and a huge register file. So GPUs don’t really have extra space after all.
They do, however, benefit from sitting in a better place on the frequency-transistor curve. (Chip manufacturers can increase either frequency or transistor counts before too many chips start coming off the assembly line broken.) Because GPUs don’t care about raw frequency, they run at the order of 1GHz but play with nearly a billion transistors. The curve itself is better for GPUs because they’re less sensitive to manufacturing defects. If some functional part is broken, NVIDIA can usually just turn it off and sell the chip as a GTS.