Computational Memory Concept

Hello, I propose a completely new technology that speeds up calculations billions of times

Computational memory concept

Parallel programming

A single address space is given - a large two-dimensional (in the future three-dimensional) square or rectangular matrix of identical 64 (128,256,512, etc.)-bit numbers

X X X X X X


Y|64 64 64 64 64 64
Y|64 64 64 64 64 64
Y|64 64 64 64 64 64
Y|64 64 64 64 64 64
Y|64 64 64 64 64 64

For each memory cell, logical and arithmetic operations are possible, as well as comparison operations with any adjacent matrix cell or with several adjacent cells.

Moreover, each cell can interact in the specified manner not only with neighboring cells, but also with any other cells of the matrix or several cells - it is enough to indicate their address.

It is not necessary to use hardware to be convinced that the technology has a bright future; for starters, you can use a simple 2-3 dimensional C++ array with parallel access using CUDA

From Wikipedia:

In computer science, in-memory processing (PIM) is a computer architecture in which data operations are available directly on the data memory, rather than having to be transferred to CPU registers first.[1] This may improve the power usage and performance of moving data between the processor and the main memory.

As far as I am aware, the Mellanox division of NVIDIA has been looking into in-memory processing for about a decade. I seem to recall that Micron and IBM were looking into it as well, but my memory is hazy.

In my concept, it is possible to manage billions of cells simultaneously in parallel. This has never happened anywhere before. This is extraterrestrial technology.

I am afraid CUDA cannot be based on extraterrestrial technology, and this is not the appropriate forum to discuss it.

And in my opinion, this is the best forum for discussing extraterrestrial technologies.

According to my calculations, such a supercomputer will fit in the palm of your hand, cost a penny, and be millions of times faster than all the computers currently operating on the planet!

Rough programming sketch

We have memory cells( for simplicity we assume that they are all float64 (double) )

new A1,B1,C1,D1,D3,D222

extern _D

Filling cells with specific values

INIT

A1C1=0,A1C2=1,A1C3=1,
A1C4=1, A1=678, A1C6=98765,
A1C7=0.5 ,A1C8=0.4,A1C9=67.678954

We continue to fill in the cells

~etc~
B1C1=888, B1C2=777.07, B1C3,
~etc~
B1C4, B1, B1C6,
B1C7, B1C8, B1C9

C1C1, C1C2, C1C3,
C1C4, C1, C1C6,
C1C7, C1C8, C1C9

~etc~
D1C1,D1C2,D1C3,
D1C4, D1, D1C6,
D1C7,D1C8,D1C9

~etc~
D222C1,D222C2,D222C3,
D222C4, D222, D222C6,
D222C7,D222C8,D222C9

Operations on memory cells

RUN

A1+B1=:A1C1
A1+B1=:B1C2
A1+B1=:A1C1,A1C2,A1C3,A1C4,A1,A1C6,A1C7,A1C8,A1C9
A1/B1=:A1C1+B1C2,A1C3-B1C4B1C6+B1C1,A1C4/A1C8+(B1C1+B1C2)(B1C8/B1C7),B1C5-(A1C2+D1)
A1+B1=:A1C3,D1C1,_D(10,11)C1,A1C7,D3,B1C8
A1+B1=:A1C3+A1C4, A1C1+B1C8,D1+D3,A1C8+D3

A1<=B1 then A1:=B1, B1=B1C1,
A1==B1 then A1:=A1C1
A1!=B1 then B1:=D1C9/D222C4

A1(AND)B1C6
A1(XOR)B1C1
A1(OR)B1
(NOT)A1

Removing variables from memory

DONE

del A1,B1,C1,D1,D3,D222

So each memory cell would comprise of an ALU. How complex would your ALUs be? (Floating Point? Multiplication? Division and special Functions?) Find out the number of transistors needed for such an ALU and multiply with the billions of memory cells. Wikipedia says 21,000 for a 32-bit integer multiplier (Transistor count - Wikipedia). Blackwell has 208 billion transistors. Being very generous you would get 100 million ALUs in so many transistors. The Blackwell has around 30,000 CUDA cores + a number of ALUs in a similar range for the tensor cores.

The question is how many transistors you would accept to use for data routing and caching. For normal CPUs that is by far the largest bulk of transistors. You wrote something about indexing - it is enough to indicate their address. Do you want to connect all memory cells to each others? That are N² connections? Or one connection, which is used one after the other, then your parallel programming model would be ‘single threaded’ again.

For many tasks, Nvidia GPUs are superior to FPGAs due to their good layered memory architecture and otherwise huge computational performance. In your model, you just ‘solve’ the data transfer for near operations, but not for operations involving operands far away.

Some of the new AI chips lean a bit further into your model compared to Nvidia GPUs. Have a look at their architectures.

I didn’t say anything about ALU. It’s closer to DRAM. I posted in the CUDA thread because there is no Compute Memory thread on the Nvidia forums!

And also because what I have in common with CUDA is that I created it in 2005 without receiving a penny from Nvidia!

That sounds like an ALU to me, especially, when you talk about 64-bit numbers. ALU = Arithmetic Logical Unit. The only difference would be that the data is not transferred to a separate ALU. Each memory cell would need its own ALU?

Why should you have received a penny from Nvidia, if they have not implemented your idea?

The closest seems to be the texture unit, which e.g. can do interpolations. Also one can do atomic operations, which are processed closer to the L2 cache.

Early graphics card could do operations on memory, e.g. all VGA graphics cards had different write modes (supporting among others ROR and logical operations).

FPGAs combine compute operations with local memory and parallel processing.

Also SIMD and vector processing is a very very old concept (IIRC they were discussed during Charles Babbage’s and Ada Lovelace’s time, i.e. when computers still were mechanical).

Initial research into processing-in-memory (PIM) architectures dates to the early 1990s. It was certainly old hat by 2005. Discussions about it resurface periodically amid (not completely unfounded) claims that “we have hit the memory wall”. So far it has been a technology of the future for three decades running.

Nvidia fully realized my CUDA idea - becoming the 3rd company in the world with a capitalization of 2 trillion dollars.
Here we need to come up with some other ALUs using DRAM capacitors. We need ALUs using comparison operations and bitwise operations. Old equipment is no good.

Forgive my terrible English - I’m Russian myself - and I’m writing through Google Translator.

So you are a part of Venray Technologies, which created TOMI, a technology focusing on compute on DRAM technology in 2005? http://venraytechnology.com

Just trust CUDA is my idea. I came up with it alone. Just for fun!

Funny (both haha and peculiar). As someone who worked on CUDA when it first came into existence, I do not recall anyone by the name of Victor Konovalov on the team. Or any Russians, for that matter, given that this is highly unlikely to be OP’s real name.

Не хотите - не верьте. Я гений от Бога.

The purpose of this sub-forum is to provide for discussion on CUDA Programming and Performance. After watching this thread unfold for a while, I’ve come to the conclusion that it is off-topic. Please restrict future posts in this sub-forum to questions and discussion that pertain directly to technologies and capability enabled by the CUDA toolkit.

I acknowledge that there is not a "“Compute Memory” sub-forum. NVIDIA doesn’t intend to provide a forum for every imaginable topic here, not even every imaginable computing topic. The main purpose of these forums is to discuss NVIDIA products and technology.

Thank you.