An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Here is an idea:

  1. The computer program issue’s load and store requests to the memory system.

Instead of only specifieing the memory address, the computer program would also specifiy the instruction pointer.

  1. The memory system fetches the memory located at the memory address and returns it but it also returns the instruction pointer.

  2. Because the instruction pointer is returned, this allows the computer program to go back to the instruction that was “stalled” or “skipped”.

  3. This would allow the computer program to issue memory requests, but not have to wait for them. The computer program would jump back to the instruction location once it’s memory
    has been retrieved. This would allow the computer program to simply move on/go on with executing other instructions and potentially retrieving other memory locations.

For example:

Load
Load
Load
Load
Load
Load

^ All these instructions specify an memory address, but also specify the instruction pointer to main memory, and also specify a register where to load the memory into.

The question is now:

What to do with other instruction which depend on the registers to be “live” / “filled” with data.

Well perhaps these instructions could all check a “flag” inside the register.

If the register is “live” the instructions get executed. If the register is “dead” (/“stalled”) the instruction is skipped.

This would allow a programmer to program a lot of semi-parallel code which gets or gets not executed depending on if the memory came in…

The program would then simply jump back at appriorate times… when the memory comes in… like an interrupt handler or like an exception handler.

So this is probably a totally new instruction flow idea.

Finally if the memory loading takes so long that the program cannot do anything anymore then the programmer could issue a wait:

Like so:

wait

The processor would then either wait for all loads to complete or some other interrupts to occur.

So another way to describe this system would be a really short abstract way:

“An event driven memory system”.

Bye,
Skybuck.

(I also posted this on usenet ! External Image)

Great idea! You can even optimize away the program counter juggling. Just continue executing instructions from the current PC until one is encountered the needs the results from the previous memory access.

Hmm,

One possible problem with this idea is the following:

load
load
load
load * hit
load
load
load

some other instruction
some other instruction
some other instruction
some other instruction * proceed
some other instruction
some other instruction

The * hit indicates that it’s memory has arrived. Now the problem is all the
other instructions. With this sequential programming pattern it would need
to skip over all other instructions to finally arrive at * proceed.

That’s a lot of wastefull skipping/no operations.

One possible solution could be to allow the programmer to specify which
instruction to execute next

load register 4 proceed some other instruction 4.

But this is probably pushing it a bit External Image

So maybe it is better to split up the program into seperate pieces, that way
the programmer only has to program one piece/kernel like in cuda.

Which automatically gets duplicated/repeated and so forth.

But then the problem is: even single threads stall.

The idea is to let the load instruction continue even while no data present
to hopefully execution something else.

So all those loads and some other instructions could also be replaced by
some calculations or so…

So perhaps some merit in this idea External Image

Bye,
Skybuck.Perhaps even a new branching instruction like so, pseudo code idea:

if load then
begin
perform operation on loaded data
end else
begin
do something else while loading.
end;

The if branch would execute if the load completed.

The else branch would execute if the load is still pending.

Bye,
Skybuck.

Yeah, I think I already tried this idea, but I was surprised cuda didn’t seem to do this… maybe I didn’t try it correctly, or maybe there was a bottleneck in the system.

But it could also be cuda simply doesn’t support it… it seems to simply stall on the load even while other instructions don’t depend on it :(

I think I now understand why cuda seems to stall on each load (I am not 100% sure but it seems like stalling).

It’s probably because of it’s desired “single instruction multiple data” behaviour.

Cuda wants to wait until all 32 threads have their memory request forfilled/their register loaded.

Only then can it continue with it’s single instruction on “multiple data”.

However… if all threads are stalled there is no reason why not all threads could simply continue with the next instruction if it does not depend on the previous load, or if the “alive” idea is used.

So again cuda could continue.

I also had a samiliar idea for sse: an instruction which did multiple random access loads with a single instruction. (never published it though External Image but there ya go).