SIMD Versus SIMT What is the difference between SIMT vs SIMD

Hi,

Can anybody elaborate exactly how is SIMT different from SIMD?

Thanks

SIMT = SIMD with multithreading.

The Tesla (and maybe the Fermi (NVIDIA next gen GPU)) is SIMD architecture that run same kernel at a time.

Cho: GeForce is not SIMD?

CUDA enabled GPUs are not strictly SIMD, but it’s very similar. But with the SIMT approach, you can completely ignore the SIMD behavior and make branches, what makes developing much easier.

GeForce and Tesla use basically the same GPU, so whatever applies to one, applies to the other. (Except having a video connector. :) )

The difference is that in SIMD architecture you are working directly with SIMD registers - for example in x86 SSE -

8 or 16 (64-bit).

A SIMD specific instructions are: Memory Load, Save; Data Interlacing, Exchanging,

Unpacking, and Merging; Swizzle, Shuffle, and Splat; Data Bit Expansion; Data Bit Reduction;

Bit Mangling; Bit Wrangling; Vector Integer and Floating point Addition, Subtraction,

Multiplication, Division with and without saturation, packed and unpacked; Multiplication and

Summation-Addition

All these instructions have versions dealing with byte, word and double words.

There are more than 200 combinations - not easy to comprehend so I am writing a visual SIMD simulator

application and code generator to make the SIMD programming easier.

SIMT = SIMD hardware with MIMD programming model

In general, data parallelism is much more important than task ||ism because only data sizes can scale up. Therefore, you want to minimize the amount of instruction processing hardware by using a SIMD architecture, but old fashioned SIMD is clumsy to program for, so I guess NVIDIA is very wise to create SIMT, which retains the efficiency of SIMD, but with much more flexibility (none of that SSE grief - massaging data into and out of vector registers) by presenting the illusion that all threads are independent.

From my observations, in most performance critical code, there are not many divergent execution paths, so SIMT should be completely adequate and full MIMD is not needed.

Currently, CUDA only allows every adjacent 32 threads to benefit from SIMT, which can cause low throughput if there’s a lot control flow divergence. But they can always relax that restriction and allow more threads that are at the same program location to benefit from SIMT.

SIMT = SIMD hardware with MIMD programming model

In general, data parallelism is much more important than task ||ism because only data sizes can scale up. Therefore, you want to minimize the amount of instruction processing hardware by using a SIMD architecture, but old fashioned SIMD is clumsy to program for, so I guess NVIDIA is very wise to create SIMT, which retains the efficiency of SIMD, but with much more flexibility (none of that SSE grief - massaging data into and out of vector registers) by presenting the illusion that all threads are independent.

From my observations, in most performance critical code, there are not many divergent execution paths, so SIMT should be completely adequate and full MIMD is not needed.

Currently, CUDA only allows every adjacent 32 threads to benefit from SIMT, which can cause low throughput if there’s a lot control flow divergence. But they can always relax that restriction and allow more threads that are at the same program location to benefit from SIMT.

It’s actually a bit more than that as there is hardware support for detecting control flow divergence and enabling/disabling SIMD channels at runtime that is noticeably lacking in most vector extension like SSE/AVX, even though useful forms (but not NVIDIA’s approach for handling arbitrary control flow) of the technology are covered by patents that expired in 2004 and could be directly grafted onto SSE/AVX (as they are in Intel/AMD GPUs), if x86 vendors had the sense to do it.

It’s actually a bit more than that as there is hardware support for detecting control flow divergence and enabling/disabling SIMD channels at runtime that is noticeably lacking in most vector extension like SSE/AVX, even though useful forms (but not NVIDIA’s approach for handling arbitrary control flow) of the technology are covered by patents that expired in 2004 and could be directly grafted onto SSE/AVX (as they are in Intel/AMD GPUs), if x86 vendors had the sense to do it.

Greg,

Are you referring to Larrabee-style predication, or something else?

Some interesting slides from Andy Glew about the differences between SIMT (“coherent vector lane threading”) and SIMD:

http://parlab.eecs.berkeley.edu/sites/all/…glew-vector.pdf

(but some googling around makes me think Uncle Joe has already read it ;) )

Greg,

Are you referring to Larrabee-style predication, or something else?

Some interesting slides from Andy Glew about the differences between SIMT (“coherent vector lane threading”) and SIMD:

http://parlab.eecs.berkeley.edu/sites/all/…glew-vector.pdf

(but some googling around makes me think Uncle Joe has already read it ;) )

I was actually referring to the predicate stack in GEN5/AMD Evergreen and the per-channel instructions pointers in GEN6[1]. It seems sad to me that architects in Intel/AMD don’t seem to talk to each other enough to transfer some of this knowledge to SSE, but with GPUs moving on-die I hope that SSE will just cease to exist instead. As far as I know, none of these techniques were used in larrabee, but I am far from an expert on larrabee.

Thanks for the presentation. As to Andy’s suggestion of fetching multiple instructions per cycle, my opinion is that one of the major advantages of SIMT techniques is that they reduce instruction fetch bandwidth and the power required to set control paths in the pipeline. The idea of fetching multiple instructions per cycle seems to be gaining steam, based off of results from this study a few years ago[2], which I think is completely flawed because they only consider the impact of area, not dynamic power (mainly due to required fetch bandwidth).

[1] - http://intellinuxgraphics.org/IHD_OS_Vol4_…_July_28_10.pdf

[2] - http://rigel.crhc.uiuc.edu/pub/micro2008-visarch.pdf

I was actually referring to the predicate stack in GEN5/AMD Evergreen and the per-channel instructions pointers in GEN6[1]. It seems sad to me that architects in Intel/AMD don’t seem to talk to each other enough to transfer some of this knowledge to SSE, but with GPUs moving on-die I hope that SSE will just cease to exist instead. As far as I know, none of these techniques were used in larrabee, but I am far from an expert on larrabee.

Thanks for the presentation. As to Andy’s suggestion of fetching multiple instructions per cycle, my opinion is that one of the major advantages of SIMT techniques is that they reduce instruction fetch bandwidth and the power required to set control paths in the pipeline. The idea of fetching multiple instructions per cycle seems to be gaining steam, based off of results from this study a few years ago[2], which I think is completely flawed because they only consider the impact of area, not dynamic power (mainly due to required fetch bandwidth).

[1] - http://intellinuxgraphics.org/IHD_OS_Vol4_…_July_28_10.pdf

[2] - http://rigel.crhc.uiuc.edu/pub/micro2008-visarch.pdf

Thanks for the links. I did not realize that Gen6 was using per-channel IPs (and that it actually works).

My understanding of Larrabee is that it was expected to use a stack-based (or counter-based) technique in software. Its ISA makes it easy to allocate a 16-bit register for each predicate mask of the stack, and use conditional jump instructions to bypass if-else or else-endif blocks when the mask is all-zero.
Since these operations are performed in the scalar x86 portion of the core which should stand idle most of the time when running SIMT code, the performance impact may not be significant.
Also, if the compiler is good enough, it can detect uniform branches and highly-divergent branches, and generate more efficient code for these cases.

I think Andy’s main point is that we should explore tradeoffs between SIMD and MIMD: try to achieve close-to-MIMD performance on irregular codes, and close-to-SIMD power efficiency on regular codes (and all variations in between)…

Thanks for the links. I did not realize that Gen6 was using per-channel IPs (and that it actually works).

My understanding of Larrabee is that it was expected to use a stack-based (or counter-based) technique in software. Its ISA makes it easy to allocate a 16-bit register for each predicate mask of the stack, and use conditional jump instructions to bypass if-else or else-endif blocks when the mask is all-zero.
Since these operations are performed in the scalar x86 portion of the core which should stand idle most of the time when running SIMT code, the performance impact may not be significant.
Also, if the compiler is good enough, it can detect uniform branches and highly-divergent branches, and generate more efficient code for these cases.

I think Andy’s main point is that we should explore tradeoffs between SIMD and MIMD: try to achieve close-to-MIMD performance on irregular codes, and close-to-SIMD power efficiency on regular codes (and all variations in between)…