Visual SIMD Sumulator Application

Hi thstart,

Can you tell me more about your visual SIMD simulator

application and code generator?

It is already a patent pending so I can tell the most interesting features and benefits.

The list long, it is still in development and here is a very short list of what this tool is doing:

  1. Parallel SIMD architecture tool for simulation and visualization

1.1) Graphical Mapping Tool - Simplifies mapping application requirements to processor resources

1.2) Static Performance Analysis - Speeds performance estimation

Assesses processor resources utilization prior to code generation

  1. High performance parallel SIMD binary design and runtime binary code generator

2.1) Run time tools

  1. Design time binary code generator

During design phase the parallel SIMD architecture tool generate and test different

scenarios and with optimized binary code.

  1. Run time binary code generator

At run time the analyze tool is analyzing the data and creation tool is

generating the optimal data dependent binary code.

  1. Data dependent weighted cost optimizations metrics for:

Instruction selection, Address calculations, Execution domains selection, Execution ports selection, Ordering instruction sequences

Full register length utilization, Address alignment, Cache usage optimization, Software pre-fetch scheduling distance optimization

Load and store execution bandwidth optimization.

  1. Permutations generation and cost estimation

Using data dependent weighted cost optimizations metrics generates and ranks most of possible permutations.

  1. Custom optimization to runtime CPU

Generates only the code custom tailored to the run time CPU. Can solve also the copy protection problem - the code can run only on

one machine and delete the unused binary.

  1. Custom data dependent runtime optimizations

Certain data-dependent optimizations are postponed to runtime, where they can be done more effectively because there is more

information about the data.

There are three costs associated with runtime code generation:

creation cost, execution cost and management costs.

In order to win, the savings of using the runtime-created code

must exceed the cost of creating and managing that code.

This means that for many applications, a fast code generator

that creates good code will be superior to a slow code generator

that creates excellent code.

Bytes, word, double word shuffling using control mask (constant) are

another group of SIMD operations producing a very valuable transformations

on input data. Each byte in the shuffle control mask (constant) forms an

index to permute the corresponding byte in the destination operand.

Generating appropriate constants for bitwise, bytewise, wordwise,

double wordwise, quad wordwise, shuffle and other SIMD is the most

important step implementing parallel SIMD software.

I would say that we are extending our visual tool for the NVIDIA GPU

architecture. Simulation and visualization can help in understanding how

the data moves in the processor. It is much more important for

NVIDIA multiprocessors. Certain memory access patterns can be

discovered which can speed up the GPU processing a lot. Some of them

can be counterintuitive and discovered only after automated benchmarks.

There are some possibilities to emulate some of SIMD operations on NVIDIA

platform. Some SIMD instructions are very useful.

Simulation and visualization is the first step to understand what happens in

the CPU and GPU so we extended the tool to CPU/GPU.

Also I believe CPU and GPU have to be used to the full potential and

to work together. Not everything is good to do in CPU and not everything

is good to do in GPU. The optimal solution is to mix the best features available.

I am attaching several screen shots showing an examples with PSHUFD

instruction with different Immediate constants which effectively creates new


Imm=00 00 00 00 is in practice a Broadcast 1st DWord to 4 DWords

Imm=03 02 01 00) is in practice a Copy 4 DWords to 4 DWords

Imm=02 03 00 01) is in practice a Swap HL DWords

Imm=02 00 03 01)) is in practice a Grouping of DWords by HL

Imm=02 00 03 01)) is in practice a Grouping of DWords by LH

Imm=03 03 02 01)) is in practice a Shift Right DWords

The possible permutations are a lot, so you can enter the desired input,

desired output and run the visual simulator to find the appropriate immediate

constant. This input/output mapping and automatic permutation generation

turns out to be particularly interesting for NVIDIA GPUs because there are

much more possibilities in it.

thstart: Sounds very interesting!! well your Patent application is published? If so can you send me the application number? I am sure it must be having more details


Better contact me directly this a little off topic indeed.

Sorry guys, but this is completely off-topic…