CUDA vs CPU and how to connect them ?

Linh_Ha · August 31, 2007, 10:41pm

First I think CUDA is the generalization of GPU model, however now I realized there’s some feature in GPU that CUDA is missing. Obviously, the support of SIMD function.
We have float2, float3,float4 but no supported SIMD function for these structure.
Swizzling is another missing features. So actually I wonder, there should be the case that GPU version of the problem will be faster than CUDA version by exploiting the SIMD functions. In that case could we use GPU and CUDA at the same time. Like can I use the CUDA to load data to the device and use GPU shader program to process data, and return result back to CPU via CUDA

Any idea is appreciated.

santyhyammer · September 1, 2007, 2:31am

G80 architecture is fully scalar. No need to pack data in 4 floats anymore.

However, I found swizzling useful sometimes… but you can do it manually because you can manage pointers as desired. For example, to do a xxyz swizzling do:

float4 aData, swizzledData;

swizzledData.x = aData.x;

swizzledData.y = aData.x;

swizzledData.z = aData.y;

swizzledData.w = aData.z;

and will be as fast as a DX9:

  float4 aData, swizzledData;

   swizzledData = aData.xxyz;

alex_dubinsky · September 1, 2007, 4:15pm

super-scalar means it scales superly. g80 is just scalar, which means it doesn’t use vectors. actually, super-scalar means that a single thread of execution gets rechopped by the cpu as it’s executed to figure out which parts of it can be run in parallel. this wastes swaths of transistors and the whole point of gpus is that they’re not super-scalar.

also: if you remember the 7900 it had 16 pipelines with 2 x float4 ALUs. The 8800 has 16 multiprocessors with 8 float1 ALUs. What nvidia (and also ati) realized is that all those SIMD instructions would work just as well if you considered each component as its own thread. Thing is that all the threads have to execute the same exact instruction anyway (it’s always been that way. that’s why branching has to be emulated), so it’s almost like an illusion and it’s the same thing either way. G80 could still even be called SIMD, although nvidia uses 'MIMD" (which is kind of dumb). I have to say this was rather clever, though i’m not sure why transistor counts had to double in G80 (longer pipelines for higher clocks?).

Linh_Ha · September 1, 2007, 5:28pm

super-scalar means it scales superly. g80 is just scalar, which means it doesn’t use vectors. actually, super-scalar means that a single thread of execution gets rechopped by the cpu as it’s executed to figure out which parts of it can be run in parallel. this wastes swaths of transistors and the whole point of gpus is that they’re not super-scalar.

also: if you remember the 7900 it had 16 pipelines with 2 x float4 ALUs. The 8800 has 16 multiprocessors with 8 float1 ALUs. What nvidia (and also ati) realized is that all those SIMD instructions would work just as well if you considered each component as its own thread. Thing is that all the threads have to execute the same exact instruction anyway (it’s always been that way. that’s why branching has to be emulated), so it’s almost like an illusion and it’s the same thing either way. G80 could still even be called SIMD, although nvidia uses 'MIMD" (which is kind of tautological). I have to say this was rather clever, though i’m not sure why transistor counts had to double.

[snapback]244943[/snapback]

Thanks a lot .

Now i can understand why the performance can be 100x, since the normal compiled C code is not SIMD (that should be done manually), then a fully optimized CUDA can be

16x8 time faster than equivalent none SIMD C code.

santyhyammer · September 2, 2007, 6:23pm

Oh true, not super-scalar, just scalar External Image

The power of the G80 basically comes from these ways:

The G80 does the instructions in less cycles. For example, a dot product or a MAD in a normal CPU is like 10 cycles. The G80 can do a dot/mad in one cycle. SSE4/5 can help here.
The G80 is like a multicore CPU but with many cores(like 128 shading “units”). Thats due to the grid/thread-block/warp design.

So, if you combine both the result is a 100x speedup easy. However the CPUs are sometimes faster if your task has a lot of branching and performs non-linear operations.

alex_dubinsky · September 2, 2007, 9:35pm

The G80 can’t do a dot product in one cycle, because it’s fully scalar. It now takes 4 cycles. A Core 2 Duo, btw, can do a vect4 multiply-add in one cycle cycle (single precision)–in each of its cores. It’s also got a Ghz advantage. In terms of peak flops, G80 is about 10x a Core 2 Duo (and only 5x than a Quad). Memory bandwidth is also 10x, but cache size is more like 1/10th.

The two things that characterize GPU architectures are:

small caches. more transistors are devoted to processing.
massive parallelism. It’s much more efficient to execute four or eight thousand threads at once with each thread running slowly than to execute 2 threads with each thread running very quickly.

The ability to fit 128 scalar processing units is a side-effect of the free space and the efficiency that comes with massive hyperthreading.

EDIT: nevermind about small caches. The total transistors devoted to SRAM is actually pretty large, but the bits get used in many ways that are ineffient for our purposes. There are half a dozen different cache types (texture, vertex, constants, output, z, stencil) and a huge register file. So GPUs don’t really have extra space after all.

They do, however, benefit from sitting in a better place on the frequency-transistor curve. (Chip manufacturers can increase either frequency or transistor counts before too many chips start coming off the assembly line broken.) Because GPUs don’t care about raw frequency, they run at the order of 1GHz but play with nearly a billion transistors. The curve itself is better for GPUs because they’re less sensitive to manufacturing defects. If some functional part is broken, NVIDIA can usually just turn it off and sell the chip as a GTS.

Topic		Replies	Views
Vector operations, swizzle and macros in CUDA CUDA Programming and Performance	3	8876	May 20, 2009
Vector operations in cuda? CUDA Programming and Performance	3	24811	May 8, 2007
Why scalar processors? CUDA Programming and Performance	21	18477	June 26, 2009
SIMD on GPU CUDA Programming and Performance	6	17930	April 29, 2009
Where are Cg's vector operations in CUDA are vector operations completely missing CUDA Programming and Performance	3	10007	April 2, 2007
packed sse-like math-funcs for float4/int4 etc CUDA Programming and Performance	4	2410	November 28, 2008
Scaling of todays CUDA programs to GT300 CUDA Programming and Performance	1	3763	August 25, 2009
G80 In comparison with G84 and G86 programming on G84 and G86 CUDA Programming and Performance	1	2087	May 25, 2007
CUDA on G80 hardware questions... Mapping the execution model to hardware CUDA Programming and Performance	10	12478	April 10, 2008
SIMD Versus SIMT What is the difference between SIMT vs SIMD CUDA Programming and Performance	15	25828	August 20, 2010

CUDA vs CPU and how to connect them ?

Related topics