Innova-2 Flex for AI?

Can anyone let me know if it is possible to implement any AI project (like Vitis AI or any other DPU) on Innova-2 Flex?

Xilinx’s DPUs require an external memory interface (DDR or HBM) and are only available as Encrypted RTL so they cannot be altered.

I have not had the time to finish it but I started adding support for the Innova-2 to the vivado-risc-v project. It includes the ability to generate RISC-V cores with the Gemmini Accelerator. It should be possible to fit a very small model and bare-metal RISC-V code within the XCKU15P’s URAM. The XCKU15P has 70.6Mbit~=8Mbytes of URAM but you will at most be able to cascade one of the four columns. With some effort you could create a proof-of-concept.

For the limited resources of the MNV303611A-EDLT I recommend you explore small embedded projects and older Neural Network to FPGA mapping research: 1, 2, 3, 4.

Can Pytorch and Transformers run on Gemmini?
Is it possible to run Gemmini on MNV303611A-EDLT with CPU and RAM on the localhost instead of the RISC-V?

Can Pytorch and Transformers run on Gemmini?

Yes for inference. See also 1, 2.

run Gemmini … instead of the RISC-V?

Gemmini is an accelerator for RISC-V. It gets implemented as part of your RISC-V core and relies on it.

Is it possible to run Gemmini on MNV303611A-EDLT with CPU and RAM on the localhost

Good question. Gemmini relies on its processor’s internals through the RoCC interface, see Pg40. If throughput and latency are not an issue then implementing a RoCC-to-AXI bridge and a customized XDMA driver and software should allow testing RoCC accelerators using the MNV303611A-EDLT. I believe this does not exist as implementing a full system in an FPGA with memory allows for better performance estimates.

Machine learning relies heavily on memory performance so all the latest accelerators depend on High-Bandwidth Memory (HBM).

If you were thinking of a simpler system that connects host system memory to a RISC-V with Gemmini, note that PCIe is host-centric. Host software would need to manage the RISC-V’s accesses to memory. On the other hand, here is a post that suggests user logic can initiate transfers.

Your best bet for a proof-of-concept project on the MNV303611A-EDLT is a small model that would also work on a microcontroller. Tensorflow-Lite for Microcontrollers supports RISC-V:

The XCKU15P has 128 blocks of (4Kx72-Bit = 288Kbit) UltraRAM for 36Mbit total:

XCKU15P_Resources

Opening up Implementation View for a project, it turns out all of it is in a single column, X0:

That means it should be possible to cascade all of it for a total of 36Mbit=4.5Mbytes.

Cascading

Some example projects I came across: 1, 2, 3

If using ucb-bar/chipyard, there is an error with ./build-setup.sh riscv-tools for 1.9.x. Try v1.8.1,

git clone https://github.com/ucb-bar/chipyard.git
cd chipyard
git checkout 1.8.1

./build-setup.sh riscv-tools

I have successfully built Gemmini hardware. Now I want to put it in Vivado to see the block design. Can you guide me do that?

Your question is too vague. Be specific about what you have done.

Gemmini is an accelerator add-on for processors. Did you build a RISC-V core with a Gemmini Accelerator? Which framework did you use?

If I were to attempt this I would first edit an existing example design for the MNV303611A-EDLT. My first goal would be the ability to communicate via JTAG or some other debugger interface with a RISC-V core. chipyard has a demo for the Arty. My next goal would be to run a bare-metal demo on the RISC-V core implemented in the FPGA. Only then would I attempt to rebuild the design to include a Gemmini Accelerator.

I have a partially working RISC-V design for the MNV303212A-ADLT you could attempt to re-target. source the design in Vivado and replace the DDR4 with a Block Memory Generator. 2MB will work. If you succeed, you will later be able to generate a RISC-V+Gemmini system using vivado-risc-v.

put it in Vivado to see the block design

Look into creating custom IP blocks and creating custom AXI peripherals.

I want to build an FPGA project that accelerates a large language model which requires more than 64GB of RAM. Since no FPGA board has that much RAM, I think I should use the RAM and CPU on the host and interact with Gemmini on FPGA through the RoCC interface. I have successfully built Gemmini hardware using this command:

cd chipyard/generators/gemmini
./scripts/build-verilator.sh

and failed to create a Vivado IP with the following steps:

Vivado > Tools > Create and Package New IP > Package a specified directory > chipyard/generators/gemmini/generated-src/verilator/chipyard.TestHarness.CustomGemminiSoCConfig/gen-collateral

Screenshot from 2023-06-06 01-55-38

requires more than 64GB of RAM … no FPGA board

The Alveo U200 has 64GB of DDR4 and DPUCADF8H supports it. Also, AWS F1 Instances.

I think I should use the RAM and CPU on the host and
interact with Gemmini on FPGA through the RoCC interface

When you tested the demo project you got complete transfer times on the order of 100,000 nanoseconds.

** Avg time device /dev/xdma0_c2h_0, total time 163964 nsec,
** Avg time device /dev/xdma0_c2h_0, total time 107604 nsec,
** Avg time device /dev/xdma0_h2c_0, total time 118067 nsec,

This is due to the latency of PCIe and software/driver overhead. PCIe bandwidth is high but latencies are not great. DDR4 has complete transfer times on the order of 100ns. Your system will be very slow.

I have successfully built Gemmini hardware using this command:

cd chipyard/generators/gemmini
./scripts/build-verilator.sh

You built a Gemmini system for the Verilator simulator. This is a much better idea than trying to use the MNV303611A-EDLT. Simulate your software running on RISC-V+Gemmini. Simulation should be the first step in hardware design.

I got some information when simulating the software:

(/mnt/Archive/Downloads/chipyard/.conda-env) notooth@192:chipyard$ cd sims/verilator

(/mnt/Archive/Downloads/chipyard/.conda-env) notooth@192:verilator$ make

(/mnt/Archive/Downloads/chipyard/.conda-env) notooth@192:verilator$ ./simulator-chipyard-RocketConfig $RISCV/riscv64-unknown-elf/share/riscv-tests/isa/rv64ui-p-simple
...
This emulator compiled with JTAG Remote Bitbang client. To enable, use +jtag_rbb_enable=1.
Listening on port 44173
[UART] UART0 is here (stdin/stdout).

(/mnt/Archive/Downloads/chipyard/.conda-env) notooth@192:verilator$ make run-binary BINARY=$RISCV/riscv64-unknown-elf/share/riscv-tests/isa/rv64ui-p-simple
...
[UART] UART0 is here (stdin/stdout).

(/mnt/Archive/Downloads/chipyard/.conda-env) notooth@192:verilator$ make run-binary BINARY=test.riscv LOADMEM=testdata.hex LOADMEM_ADDR=81000000
...
[UART] UART0 is here (stdin/stdout).
make: *** [/mnt/Archive/Downloads/chipyard/common.mk:304: run-binary] Error 255

This is a ucb-bar/chipyard Issue. Try running their Docker Image.

I want to build an FPGA project that accelerates a large
language model which requires more than 64GB of RAM.

Attempting to run an LLM faster than on a CPU while using the MNV303611A-EDLT would be a massive project and the likelihood of success is low.

Your board shows up as PCIe device 17:00 and you mention 64GB so you are probably using a server with lots of memory. Consider quantizing the model and figuring out how to speed it up using CPU Optimizations.

The MNV303611A-EDLT can help you design a proof-of-concept or prototype but I do not see how you can accelerate an LLM with it. If CPU performance is not good enough you should explore GPUs first before trying to use FPGAs.

Can you give me an example of C code that waits for a result of computing on the FPGA after sending it input data?

Can you give me an example of C code that waits for a
result of computing on the FPGA after sending it input data?

How this should be achieved will depend heavily on the design of your system. Eventually you should use XDMA Interrupts but as a first step to test your ideas, try busy-waiting.

Please take a step back and try running a simple test. The innova2_mnv303611a_xcku15p_xdma demo project creates 4 sets of files to access the design’s AXI Bus.

/dev/xdma0_h2c_0
/dev/xdma0_c2h_0
/dev/xdma0_h2c_1
/dev/xdma0_c2h_1
...

Create two copies of the xdma_test.c demo and run each in a separate Terminal with a different set of /dev/xdma0_... files. Modify each to simulate a system where one thread writes to the AXI BRAM and the other waits for it to finish and then sends new data to the BRAM. Use a simple flag and look into locks, 1.

Once you have that working, source the design in Vivado and modify it by adding a MicroBlaze soft processor without external memory. Connect it to the AXI BRAM and then write a simple bare-metal C program for it to test the concept. References: 1, 2, 3, 4.

Vivado_Source_Project_TCL

This may seem like a step back from your eventual goal but it will be a lot easier to find help for Vivado+MicroBlaze and all the concepts will apply later to getting RISC-V+Gemmini working.

I am not building RISC-V+Gemmini. I am just building a floating-point computing module on FPGA.

Modify each to simulate a system where one thread writes to the AXI BRAM and the other waits for it to finish and then sends new data to the BRAM. Use a simple flag and look into locks, 1.

How can I access a flag of a C program from a separate Terminal?

How can I access a flag of a C program from a separate Terminal?

A flag is some location in memory you use to store the state of something. You arbitrarily decide on its location and value.

Create two copies of xdma_test.c and come up with a simple communication protocol between them. Treat them as separate threads. For example, replace the AXI GPIO code sections with:

xdma_test1.c:

...
   printf("Waiting for BRAM Address 0 to equal 42:\n");

   while(1)
   {
      rc = pread(xdma_fd_read, (char *)read_data, 1, axi_bram_addr);
      if (rc < 0) {
         fprintf(stderr, "%s, read data @ 0x%lX failed, %ld.\n",
            xdma_c2h_name, axi_bram_addr, rc);
         perror("File Read");
         return -EIO;
      }
   
      // stderr is not line buffered so will print immediately
      fprintf(stderr, ".");
      usleep(10000);
      
      if (read_data[0] == 42) { break; }

   }

   printf("\nSuccess! - Read 42 from the first BRAM byte.\n");
...

xdma_test2.c:

...
   wrte_data[0] = 42;

   printf("Writing 42 to BRAM Address 0\n");

   // Write 1 byte using pwrite, which combines lseek and write
   rc = pwrite(xdma_fd_wrte, (char *)wrte_data, 1, axi_bram_addr);
   if (rc < 0) {
      fprintf(stderr, "%s, write byte @ 0x%lX failed, %ld.\n",
         xdma_h2c_name, axi_bram_addr, rc);
      perror("File Write");
      return -EIO;
   }


   printf("Wrote 0x%02X to %s at address 0x%lX",
      wrte_data[0], xdma_h2c_name, axi_bram_addr);
...

Then run each program in a separate Terminal:

The program running in one terminal busy-waits for the program running in the second to write a specific value to a specific location in the FPGA’s AXI BRAM.

I am just building a floating-point computing module on FPGA

How you communicate with it will depend on how it is implemented.

Here is a tutorial on adding FFT to a Vivado Block design. It is for an older version of Vivado so everything looks different but the key concepts are there and the code to communicate with the FFT Block is in Python.