SIMD Versus SIMT What is the difference between SIMT vs SIMD

Kiran_CUDA · June 28, 2009, 10:41am

Hi,

Can anybody elaborate exactly how is SIMT different from SIMD?

Thanks

cho · June 30, 2009, 4:23am

SIMT = SIMD with multithreading.

The Tesla (and maybe the Fermi (NVIDIA next gen GPU)) is SIMD architecture that run same kernel at a time.

Kiran_CUDA · June 30, 2009, 12:27pm

Cho: GeForce is not SIMD?

Tobi_W · June 30, 2009, 1:03pm

CUDA enabled GPUs are not strictly SIMD, but it’s very similar. But with the SIMT approach, you can completely ignore the SIMD behavior and make branches, what makes developing much easier.

seibert · June 30, 2009, 3:53pm

GeForce and Tesla use basically the same GPU, so whatever applies to one, applies to the other. (Except having a video connector. :) )

thstart · June 30, 2009, 3:54pm

The difference is that in SIMD architecture you are working directly with SIMD registers - for example in x86 SSE -

8 or 16 (64-bit).

A SIMD specific instructions are: Memory Load, Save; Data Interlacing, Exchanging,

Unpacking, and Merging; Swizzle, Shuffle, and Splat; Data Bit Expansion; Data Bit Reduction;

Bit Mangling; Bit Wrangling; Vector Integer and Floating point Addition, Subtraction,

Multiplication, Division with and without saturation, packed and unpacked; Multiplication and

Summation-Addition

All these instructions have versions dealing with byte, word and double words.

There are more than 200 combinations - not easy to comprehend so I am writing a visual SIMD simulator

application and code generator to make the SIMD programming easier.

Uncle_Joe · August 18, 2010, 9:34pm

SIMT = SIMD hardware with MIMD programming model

In general, data parallelism is much more important than task ||ism because only data sizes can scale up. Therefore, you want to minimize the amount of instruction processing hardware by using a SIMD architecture, but old fashioned SIMD is clumsy to program for, so I guess NVIDIA is very wise to create SIMT, which retains the efficiency of SIMD, but with much more flexibility (none of that SSE grief - massaging data into and out of vector registers) by presenting the illusion that all threads are independent.

From my observations, in most performance critical code, there are not many divergent execution paths, so SIMT should be completely adequate and full MIMD is not needed.

Currently, CUDA only allows every adjacent 32 threads to benefit from SIMT, which can cause low throughput if there’s a lot control flow divergence. But they can always relax that restriction and allow more threads that are at the same program location to benefit from SIMT.

Uncle_Joe · August 18, 2010, 9:34pm

SIMT = SIMD hardware with MIMD programming model

In general, data parallelism is much more important than task ||ism because only data sizes can scale up. Therefore, you want to minimize the amount of instruction processing hardware by using a SIMD architecture, but old fashioned SIMD is clumsy to program for, so I guess NVIDIA is very wise to create SIMT, which retains the efficiency of SIMD, but with much more flexibility (none of that SSE grief - massaging data into and out of vector registers) by presenting the illusion that all threads are independent.

From my observations, in most performance critical code, there are not many divergent execution paths, so SIMT should be completely adequate and full MIMD is not needed.

Currently, CUDA only allows every adjacent 32 threads to benefit from SIMT, which can cause low throughput if there’s a lot control flow divergence. But they can always relax that restriction and allow more threads that are at the same program location to benefit from SIMT.

Gregory_Diamos · August 18, 2010, 11:08pm

It’s actually a bit more than that as there is hardware support for detecting control flow divergence and enabling/disabling SIMD channels at runtime that is noticeably lacking in most vector extension like SSE/AVX, even though useful forms (but not NVIDIA’s approach for handling arbitrary control flow) of the technology are covered by patents that expired in 2004 and could be directly grafted onto SSE/AVX (as they are in Intel/AMD GPUs), if x86 vendors had the sense to do it.

Gregory_Diamos · August 18, 2010, 11:08pm

It’s actually a bit more than that as there is hardware support for detecting control flow divergence and enabling/disabling SIMD channels at runtime that is noticeably lacking in most vector extension like SSE/AVX, even though useful forms (but not NVIDIA’s approach for handling arbitrary control flow) of the technology are covered by patents that expired in 2004 and could be directly grafted onto SSE/AVX (as they are in Intel/AMD GPUs), if x86 vendors had the sense to do it.

Sylvain_Collange · August 19, 2010, 9:57am

Greg,

Are you referring to Larrabee-style predication, or something else?

Some interesting slides from Andy Glew about the differences between SIMT (“coherent vector lane threading”) and SIMD:

http://parlab.eecs.berkeley.edu/sites/all/…glew-vector.pdf

(but some googling around makes me think Uncle Joe has already read it ;) )

Sylvain_Collange · August 19, 2010, 9:57am

Greg,

Are you referring to Larrabee-style predication, or something else?

Some interesting slides from Andy Glew about the differences between SIMT (“coherent vector lane threading”) and SIMD:

http://parlab.eecs.berkeley.edu/sites/all/…glew-vector.pdf

(but some googling around makes me think Uncle Joe has already read it ;) )

Gregory_Diamos · August 19, 2010, 4:33pm

I was actually referring to the predicate stack in GEN5/AMD Evergreen and the per-channel instructions pointers in GEN6[1]. It seems sad to me that architects in Intel/AMD don’t seem to talk to each other enough to transfer some of this knowledge to SSE, but with GPUs moving on-die I hope that SSE will just cease to exist instead. As far as I know, none of these techniques were used in larrabee, but I am far from an expert on larrabee.

Thanks for the presentation. As to Andy’s suggestion of fetching multiple instructions per cycle, my opinion is that one of the major advantages of SIMT techniques is that they reduce instruction fetch bandwidth and the power required to set control paths in the pipeline. The idea of fetching multiple instructions per cycle seems to be gaining steam, based off of results from this study a few years ago[2], which I think is completely flawed because they only consider the impact of area, not dynamic power (mainly due to required fetch bandwidth).

[1] - http://intellinuxgraphics.org/IHD_OS_Vol4_…_July_28_10.pdf

[2] - http://rigel.crhc.uiuc.edu/pub/micro2008-visarch.pdf

Gregory_Diamos · August 19, 2010, 4:33pm

I was actually referring to the predicate stack in GEN5/AMD Evergreen and the per-channel instructions pointers in GEN6[1]. It seems sad to me that architects in Intel/AMD don’t seem to talk to each other enough to transfer some of this knowledge to SSE, but with GPUs moving on-die I hope that SSE will just cease to exist instead. As far as I know, none of these techniques were used in larrabee, but I am far from an expert on larrabee.

Thanks for the presentation. As to Andy’s suggestion of fetching multiple instructions per cycle, my opinion is that one of the major advantages of SIMT techniques is that they reduce instruction fetch bandwidth and the power required to set control paths in the pipeline. The idea of fetching multiple instructions per cycle seems to be gaining steam, based off of results from this study a few years ago[2], which I think is completely flawed because they only consider the impact of area, not dynamic power (mainly due to required fetch bandwidth).

[1] - http://intellinuxgraphics.org/IHD_OS_Vol4_…_July_28_10.pdf

[2] - http://rigel.crhc.uiuc.edu/pub/micro2008-visarch.pdf

Sylvain_Collange · August 20, 2010, 10:18am

Thanks for the links. I did not realize that Gen6 was using per-channel IPs (and that it actually works).

My understanding of Larrabee is that it was expected to use a stack-based (or counter-based) technique in software. Its ISA makes it easy to allocate a 16-bit register for each predicate mask of the stack, and use conditional jump instructions to bypass if-else or else-endif blocks when the mask is all-zero.
Since these operations are performed in the scalar x86 portion of the core which should stand idle most of the time when running SIMT code, the performance impact may not be significant.
Also, if the compiler is good enough, it can detect uniform branches and highly-divergent branches, and generate more efficient code for these cases.

I think Andy’s main point is that we should explore tradeoffs between SIMD and MIMD: try to achieve close-to-MIMD performance on irregular codes, and close-to-SIMD power efficiency on regular codes (and all variations in between)…

Sylvain_Collange · August 20, 2010, 10:18am

Thanks for the links. I did not realize that Gen6 was using per-channel IPs (and that it actually works).

My understanding of Larrabee is that it was expected to use a stack-based (or counter-based) technique in software. Its ISA makes it easy to allocate a 16-bit register for each predicate mask of the stack, and use conditional jump instructions to bypass if-else or else-endif blocks when the mask is all-zero.
Since these operations are performed in the scalar x86 portion of the core which should stand idle most of the time when running SIMT code, the performance impact may not be significant.
Also, if the compiler is good enough, it can detect uniform branches and highly-divergent branches, and generate more efficient code for these cases.

I think Andy’s main point is that we should explore tradeoffs between SIMD and MIMD: try to achieve close-to-MIMD performance on irregular codes, and close-to-SIMD power efficiency on regular codes (and all variations in between)…

Topic		Replies	Views
SIMT == SIMD? CUDA Programming and Performance	4	25937	April 3, 2009
Back to SIMD CUDA Programming and Performance	21	143	November 12, 2024
300x to 600x times faster... really? CUDA Programming and Performance	92	34399	February 8, 2010
Intel paper: Debunking the 100X GPU vs. CPU myth CUDA Programming and Performance	36	25209	April 7, 2011
SIMT ,SIMD,SPMD, CUDA Programming and Performance	2	17175	June 6, 2010
Modern GPU CUDA Programming and Performance	30	5661	April 11, 2016
CUDA SUCKS!!! Why <block, thread> cannot be judged by itself CUDA Programming and Performance	20	8124	February 17, 2015
How to understand the "hide latency" CUDA Programming and Performance	13	3189	August 8, 2024
SIMD and SIMT Understanding the difference CUDA Programming and Performance	2	2694	October 6, 2009
Newbie - Need to use shared mem? CUDA Programming and Performance	27	14988	December 17, 2008

SIMD Versus SIMT What is the difference between SIMT vs SIMD

Related topics