Why scalar processors?

Kiran_CUDA · June 23, 2009, 11:01am

Hi,

I am wondering why NVIDIA replaced the vector processor by Scalar Processor (SP) in each MP? SP’s can operate on a single component at a time while Vector Processor could work simultaneously on many components; there by increasing the speed by many folds.
Since I am not that expert in parallel programming I am unable to digest this fact :rolleyes:
Any pointers in this direction??

Thanks

seibert · June 23, 2009, 11:47am

It is a question of chip area. If all you were doing was manipulating 3 or 4-component vectors, then a special vector arithmetic unit would speed things up. The problem is that the vector arithmetic unit takes up more space than a scalar unit, and the vector unit might not be fully used if your code is not operating on vectors. (In addition, vector instructions increase the complexity of the instruction set, requiring more chip area for the instruction decoder.)

NVIDIA decided when creating CUDA that would be a better tradeoff for general purpose computing to eliminate the special vector hardware, and put more scalar processors on the chip instead.

Kiran_CUDA · June 23, 2009, 12:08pm

Thanks Seibert!!

The problem with increases Chip area and simplicity in the instruction set is fine. But do you think these were the only driving forces while making the choice of SPs. I have seen nVIDIAs QuadraoPlex solutions that are shipped with large sized box PC. How is then size an extremely important parameter in making this decision. Also probably I would be convinced if you could give me an idea of the scale up in the area that might result if Vector Processors are used instead of SPs. (For example if the scale up in the area while using Vector processor is twice then I don’t think that Vector Processors are the bad idea since the speed gain will be twice as well )

MisterAnderson42 · June 23, 2009, 2:08pm

I can’t say why NVIDIA made the decision. But for me as an application developer working on compute only applications, scalar processors are the only thing that make sense. Some (maybe even most) general purpose compute applications are nearly impossible to write so that all vector elements are being used all the time. And if it is even possible, the time spent packing/unpacking memory can overwhelm the savings.

CUDA’s breaking up the “vector” elements into independent threads in a warp is just such a natural environment in which to program. One has to assume that this ease of programming was at least one of the considerations that were made by the design team. It has certainly played a big role in the widespread adoption of CUDA with many scientists (who are not hardware experts) getting great speedups for their calculations using CUDA.

Kiran_CUDA · June 23, 2009, 2:27pm

I can’t say why NVIDIA made the decision. But for me as an application developer working on compute only applications, scalar processors are the only thing that make sense. Some (maybe even most) general purpose compute applications are nearly impossible to write so that all vector elements are being used all the time. And if it is even possible, the time spent packing/unpacking memory can overwhelm the savings.

CUDA’s breaking up the “vector” elements into independent threads in a warp is just such a natural environment in which to program. One has to assume that this ease of programming was at least one of the considerations that were made by the design team. It has certainly played a big role in the widespread adoption of CUDA with many scientists (who are not hardware experts) getting great speedups for their calculations using CUDA.

Thanks MisterAnderson…

Well could you please elaborate “an application developer working on compute only applications, scalar processors are the only thing that make sense”…

avidday · June 23, 2009, 2:59pm

I recall seeing a lecture by John Nickolls where he discussed the evolution of NVIDIA gpus away from an VLIW SIMD architecture towards the current scalar, RISC approach. He mentioned that scheduling granularity was one of the main advantages of the G80 over earlier vector designs, where imperfect instruction pipeling without cache can incur enormous occupancy and branch divergence penalties.

If you want an SIMD GPU, just wait for Larabee. But I suspect you will be disappointed.

nitin.life · June 23, 2009, 3:44pm

ONE LINE ANSWER : ONLY NVIDIA KNOWS. :)

I am not a Computer Hardware person… but I guess Nvidia considered cost to performance benefit and tagerted user base when designing the GPU architecture.

The Tesla architecture was designed for graphical processing (cuda came little later) . So cost /performance is a very important criteria. A vector processor would take up more space (more money and power) , will be more complex to design and extract full performance from.

I have experience with the IBM cell processor ( some level of vectorizing is there in SPE’s ). Its more complex to program ( there are multiple pipelines for different operations and all etc… ) and hence its not that easy to extract performance plus also COSTS much more. It serves more general application base rather then just core computations.

The Tesla architecture; if it was targeted only for compute application then maybe NVidia mite had considered an extra vector unit as the extra power and the money would be justified. But graphics card will be used by everyday programmers also… so Nvidia will think about its core competency’s first.

That’s my take ( I MAYBE BE TOTALLY WRONG ) … I would like to see what NVidia guys have to say for this…

Gregory_Diamos · June 23, 2009, 4:10pm

I do not consider the current architecture to be a scalar design, but rather a compromise between a scalar and a vector design.

When people think of vector machines today they think of Intel SSE and maybe IBM Altivec where there is a separate unit that is explicitly issued instructions to operate on many data elements. The problem with these programming models is that they cannot handle control flow – all functional units in the vector unit always have to perform the same computation. There have been several proposals to solve this problem with predication where certain functional units in the vector unit are turned off depending on the value of a mask register.

For example consider a 4-way vector machine, the code:

a[4] = {1, 2, 3, 4};

b[4] = {1, 2, 3, 4};

c[4] = a + b;

can be handled by a standard vector unit. However, when you add control flow:

condition[4] = {0, 1, 0, 1};

a[4] = {1, 2, 3, 4};

b[4] = {1, 2, 3, 4};

c[4];

if( condition[4] )

{

c = a + b;

}

else

{

c = a - b;

}

A traditional vector unit cannot handle this. It can be handled by adding predication, which will be executed logically as

if(1) : a[4] = {1, 2, 3, 4};

if(1) : b[4] = {1, 2, 3, 4};

if( condition[4] ) : c = a + b;

if( !condition[4] ) : c = a - b;

In this case, you are performing different computation depending on the value of condition, but you are executing on a vector unit rather than running four scalar units in parallel. From what I have read, NVIDIA’s architecture works like this. There is a logical scalar programming model that is mapped onto a physical vector unit. They have a very novel method of handling back edges (loops) as well, which is probably too complicated for me to describe here.

sergeyn · June 23, 2009, 6:21pm

As far as I’m aware ati uses 5-vector instructions in their recent hardware. They also have cuda-like system for general purpose computations (though it is not that popular yet)

Gregory_Diamos · June 23, 2009, 6:42pm

They use a VLIW instruction set rather than a vector instruction set. In a VLIW machine, multiple operations are packed into a single instruction. For example, the example that I gave above:

condition[4] = {0, 1, 0, 1};

a[4] = {1, 2, 3, 4};

b[4] = {1, 2, 3, 4};

c[4];

if( condition[4] )

{

c = a + b;

}

else

{

c = a - b;

}

could be compiled into

( MOV, MOV, MOV, MOV ) a, {1, 2, 3, 4};

( MOV, MOV, MOV, MOV ) b, {1, 2, 3, 4};

( SUB, ADD, SUB, ADD ) c, a, b;

As far as having a CUDA-like system, they have been talking about their CTM architecture for several years though the tools seem far less mature, and require you to program in terms of assembly code and low level commands for setting up and starting kernels rather than having a high level language like CUDA.

They seem to have quietly posted documentation of their intermediate language for issuing GPU commands http://developer.amd.com/gpu_assets/Interm…m_Processor.pdf , as well a user guide for their SDK http://developer.amd.com/gpu_assets/Stream…_User_Guide.pdf . I have not tried any of this out yet, but it looks like they have a compiler from Brook++ to their IR. If people are new to Brook, it was a language developed at stanford for the merrimac supercomputer ( http://merrimac.stanford.edu/ ) and some of the early GPGPU work (predating CUDA) was done with a modified version of Brook http://www-graphics.stanford.edu/projects/…kgpu/index.html . I personally think that CUDA and OpenCL are more expressive than Brook, but it has some interesting language features that let you do more aggressive optimizations if you can express your program in Brook.

tmurray · June 23, 2009, 6:53pm

as an aside, Ian Buck, whose PhD thesis was on BrookGPU, is also the creator of CUDA.

sergeyn · June 23, 2009, 7:13pm

Ð¡TM is dead. They use high-level cuda-like language now.

Kiran_CUDA · June 24, 2009, 8:55am

Avidday I am confused a little bit. You said Larabee will be SIMD GPU …my questions is: is nt nVIDIA CUDA compatible GPUs SIMD GPU??

Also could you please explain what do we mean by “scheduling granularity” ??

Kiran_CUDA · June 24, 2009, 9:12am

I do not consider the current architecture to be a scalar design, but rather a compromise between a scalar and a vector design.

When people think of vector machines today they think of Intel SSE and maybe IBM Altivec where there is a separate unit that is explicitly issued instructions to operate on many data elements. The problem with these programming models is that they cannot handle control flow – all functional units in the vector unit always have to perform the same computation. There have been several proposals to solve this problem with predication where certain functional units in the vector unit are turned off depending on the value of a mask register.

For example consider a 4-way vector machine, the code:
a[4] = {1, 2, 3, 4};

b[4] = {1, 2, 3, 4};

c[4] = a + b;
can be handled by a standard vector unit. However, when you add control flow:
condition[4] = {0, 1, 0, 1};

a[4] = {1, 2, 3, 4};

b[4] = {1, 2, 3, 4};

c[4];

if( condition[4] )

{

c = a + b;

}

else

{

c = a - b;

}
A traditional vector unit cannot handle this. It can be handled by adding predication, which will be executed logically as
if(1) : a[4] = {1, 2, 3, 4};

if(1) : b[4] = {1, 2, 3, 4};

if( condition[4] ) : c = a + b;

if( !condition[4] ) : c = a - b;
In this case, you are performing different computation depending on the value of condition, but you are executing on a vector unit rather than running four scalar units in parallel. From what I have read, NVIDIA’s architecture works like this. There is a logical scalar programming model that is mapped onto a physical vector unit. They have a very novel method of handling back edges (loops) as well, which is probably too complicated for me to describe here.

Thanks Gregory!!!

Could you please provide some reference in supports of your argument that “From what I have read, NVIDIA’s architecture works like this. There is a logical scalar programming model that is mapped onto a physical vector unit”… I want to understand this line fully and I know the this line (There is a logical scalar programming model that is mapped onto a physical vector unit".)will answer many of my questions.

Also I want to delve deeper into the CUDA architecture and would like to know how it handles back edges (loops)?.. Could you please provide me some links or files describing these things in detail.

Thanks Gregory for your time

cho · June 24, 2009, 10:25am

They are not really scalar processors. External Image

Simon_Green · June 24, 2009, 11:31am

As an aside, much of the justification for the transition from a vector to scalar GPU architecture was actually driven by analysis of pixel shaders of the time - when G80 was designed, there wasn’t a lot of compute code around. It was found that a large portion of pixel shader instructions were scalar and therefore not making optimal use of the 4-vector hardware.

avidday · June 24, 2009, 11:54am

NVIDIA GPUs clearly aren’t SIMD in the usual sense. They have no vector instructions, they have no vector units.

Any introduction to operating system theory textbook should contain everything you need to know.

Kiran_CUDA · June 24, 2009, 12:28pm

Hi Cho…can you explain in detail as per your understanding what type of processors are they ??? :wacko:

Gregory_Diamos · June 24, 2009, 12:31pm

Sure Kiran,

Basic information supporting this is available in the programming guide.

“A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. [1] If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.”

I say that I think a logical scalar model is mapped onto vector hardware because all threads in a warp execute a single instruction. In this case, you can think of a GPU as a 32-wide vector machine.

A warp in this statement is NVIDIA’s term for a group of logical threads that are executed together (in the example that I gave before the warp size would be 4). The second sentence, [1], suggests that the hardware uses predication to handle

control flow as per my second example above.

I have tried before to explain handling of loops and I still don’t have a really intuitive way of doing it. Here are some references that hint at how I think it is done:

http://www.freepatentsonline.com/6947047.html - NVIDIA patent describing serialization

http://graphics.cs.uni-sb.de/~woop/rpu/RPU_SIGGRAPH05.pdf - First published paper (that I am aware of) describing how to handle branches

http://www.google.com/url?sa=U&start=1…-J1I670rwV0IQSw - Describes how you can use a combination of compiler analysis and a hardware scheduler to dynamically map threads onto a vector processor

http://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf - Overview of NVIDIA-like GPU architecture

Here is another qoute from Sylvain Collange’s tech report on Barra:

“As the underlying hardware is a vector processor, threads are grouped together in a so called warps which operate on vector registers. Therefore the warp size is 32. At each cycle an instruction is executed on a warp by a multiprocessor.”

Read section 4.4.3 in the same tech report ( http://hal.archives-ouvertes.fr/docs/00/37…instruction.pdf ) to see an example of how back edges can be handled.

Kiran_CUDA · June 26, 2009, 12:00pm

Sure Kiran,

Basic information supporting this is available in the programming guide.

“A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. [1] If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.”

I say that I think a logical scalar model is mapped onto vector hardware because all threads in a warp execute a single instruction. In this case, you can think of a GPU as a 32-wide vector machine.

A warp in this statement is NVIDIA’s term for a group of logical threads that are executed together (in the example that I gave before the warp size would be 4). The second sentence, [1], suggests that the hardware uses predication to handle

control flow as per my second example above.

I have tried before to explain handling of loops and I still don’t have a really intuitive way of doing it. Here are some references that hint at how I think it is done:

http://www.freepatentsonline.com/6947047.html - NVIDIA patent describing serialization

http://graphics.cs.uni-sb.de/~woop/rpu/RPU_SIGGRAPH05.pdf - First published paper (that I am aware of) describing how to handle branches

http://www.google.com/url?sa=U&start=1…-J1I670rwV0IQSw - Describes how you can use a combination of compiler analysis and a hardware scheduler to dynamically map threads onto a vector processor

http://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf - Overview of NVIDIA-like GPU architecture

Here is another qoute from Sylvain Collange’s tech report on Barra:

“As the underlying hardware is a vector processor, threads are grouped together in a so called warps which operate on vector registers. Therefore the warp size is 32. At each cycle an instruction is executed on a warp by a multiprocessor.”

Read section 4.4.3 in the same tech report ( http://hal.archives-ouvertes.fr/docs/00/37…instruction.pdf ) to see an example of how back edges can be handled.

Thanks a tone Gregory for your detail information!! Let me go deeper and I will be get back to you gain :)

Topic		Replies	Views
'Computations server' application design advice CUDA Programming and Performance	24	12675	March 23, 2007
CUDA/PTX Emulator Would Anyone Be Interested? CUDA Programming and Performance	22	9613	June 25, 2013
CUDA SUCKS!!! Why <block, thread> cannot be judged by itself CUDA Programming and Performance	20	8138	February 17, 2015
CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ? CUDA Programming and Performance	46	45203	April 17, 2011
CUDA compiler bug or user error? CUDA Programming and Performance	28	2468	July 28, 2017
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204316	April 13, 2009
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19603	July 5, 2011
cuda for ati cards we need a stadard CUDA Programming and Performance	27	43373	October 3, 2008
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10330	April 5, 2012
cant call any kernel function CUDA Programming and Performance	8	4833	June 6, 2011

Why scalar processors?

Related topics