CUDA vs CPU and how to connect them ?

First I think CUDA is the generalization of GPU model, however now I realized there’s some feature in GPU that CUDA is missing. Obviously, the support of SIMD function.
We have float2, float3,float4 but no supported SIMD function for these structure.
Swizzling is another missing features. So actually I wonder, there should be the case that GPU version of the problem will be faster than CUDA version by exploiting the SIMD functions. In that case could we use GPU and CUDA at the same time. Like can I use the CUDA to load data to the device and use GPU shader program to process data, and return result back to CPU via CUDA

Any idea is appreciated.

G80 architecture is fully scalar. No need to pack data in 4 floats anymore.

However, I found swizzling useful sometimes… but you can do it manually because you can manage pointers as desired. For example, to do a xxyz swizzling do:

float4 aData, swizzledData;

swizzledData.x = aData.x;

swizzledData.y = aData.x;

swizzledData.z = aData.y;

swizzledData.w = aData.z;

and will be as fast as a DX9:

  float4 aData, swizzledData;

   swizzledData = aData.xxyz;

super-scalar means it scales superly. g80 is just scalar, which means it doesn’t use vectors. actually, super-scalar means that a single thread of execution gets rechopped by the cpu as it’s executed to figure out which parts of it can be run in parallel. this wastes swaths of transistors and the whole point of gpus is that they’re not super-scalar.

also: if you remember the 7900 it had 16 pipelines with 2 x float4 ALUs. The 8800 has 16 multiprocessors with 8 float1 ALUs. What nvidia (and also ati) realized is that all those SIMD instructions would work just as well if you considered each component as its own thread. Thing is that all the threads have to execute the same exact instruction anyway (it’s always been that way. that’s why branching has to be emulated), so it’s almost like an illusion and it’s the same thing either way. G80 could still even be called SIMD, although nvidia uses 'MIMD" (which is kind of dumb). I have to say this was rather clever, though i’m not sure why transistor counts had to double in G80 (longer pipelines for higher clocks?).

Thanks a lot .

Now i can understand why the performance can be 100x, since the normal compiled C code is not SIMD (that should be done manually), then a fully optimized CUDA can be

16x8 time faster than equivalent none SIMD C code.

Oh true, not super-scalar, just scalar External Image

The power of the G80 basically comes from these ways:

  1. The G80 does the instructions in less cycles. For example, a dot product or a MAD in a normal CPU is like 10 cycles. The G80 can do a dot/mad in one cycle. SSE4/5 can help here.

  2. The G80 is like a multicore CPU but with many cores(like 128 shading “units”). Thats due to the grid/thread-block/warp design.

So, if you combine both the result is a 100x speedup easy. However the CPUs are sometimes faster if your task has a lot of branching and performs non-linear operations.

The G80 can’t do a dot product in one cycle, because it’s fully scalar. It now takes 4 cycles. A Core 2 Duo, btw, can do a vect4 multiply-add in one cycle cycle (single precision)–in each of its cores. It’s also got a Ghz advantage. In terms of peak flops, G80 is about 10x a Core 2 Duo (and only 5x than a Quad). Memory bandwidth is also 10x, but cache size is more like 1/10th.

The two things that characterize GPU architectures are:

  1. small caches. more transistors are devoted to processing.

  2. massive parallelism. It’s much more efficient to execute four or eight thousand threads at once with each thread running slowly than to execute 2 threads with each thread running very quickly.

The ability to fit 128 scalar processing units is a side-effect of the free space and the efficiency that comes with massive hyperthreading.

EDIT: nevermind about small caches. The total transistors devoted to SRAM is actually pretty large, but the bits get used in many ways that are ineffient for our purposes. There are half a dozen different cache types (texture, vertex, constants, output, z, stencil) and a huge register file. So GPUs don’t really have extra space after all.

They do, however, benefit from sitting in a better place on the frequency-transistor curve. (Chip manufacturers can increase either frequency or transistor counts before too many chips start coming off the assembly line broken.) Because GPUs don’t care about raw frequency, they run at the order of 1GHz but play with nearly a billion transistors. The curve itself is better for GPUs because they’re less sensitive to manufacturing defects. If some functional part is broken, NVIDIA can usually just turn it off and sell the chip as a GTS.