Project Overview: FP4 E2M1 Arithmetic Unit
Target Process: SkyWater 130nm (SKY130)
Scope: RTL-to-GDSII Open Source Silicon Flow
Introduction
This project implements a high-performance Arithmetic Unit (AU) specialized for the FP4 E2M1 (4-bit Floating Point, 1 sign bit, 2 exponent bits, 1 mantissa bit) data format. As AI models scale toward trillion-parameter architectures, the industry is shifting toward ultra-low precision formats to reduce memory bottleneck and power consumption. This design explores the feasibility of FP4 arithmetic using the SKY130 open-source PDK, taking the design from Verilog RTL all the way to a tape-out-ready GDSII layout.
Technical Specifications
The unit is optimized for the OCP Microscaling (MX) specification, providing a balance between dynamic range and precision that is critical for LLM inference quantization.
-
Format Details: 1-bit Sign, 2-bit Exponent (Bias of 1), 1-bit Mantissa.
-
Architecture: Fully combinational and pipelined versions for Dot-Product Engines.
-
Special Features: Support for subnormal numbers and NaN/Inf handling according to MX standards.Toolchain: Developed using the OpenLane flow (Yosys for synthesis, OpenROAD for PnR, and Magic/Netgen for DRC/LVS).
Hardware Implementation Results (SKY130)
By utilizing the SKY130 high-density (HD) library, this FP4 unit achieves significant area savings compared to standard INT8 or FP16 units:
| Metric | FP4 E2M1 Unit (Estimated) |
|---|---|
| Cell Count | ~150–200 gates per lane |
| Max Frequency | ~250 MHz (on SKY130) |
| Power Density | Optimized for low-switching activity |
| Area | < 0.01 $mm^2 per core |
Why FP4 on SKY130?
While NVIDIA’s Blackwell architecture introduces hardware support for FP4, this project aims to provide the open-source community with a reference implementation. By targeting the SKY130 process, we demonstrate that specialized AI hardware can be designed and verified using entirely free and open tools, lowering the barrier for custom AI accelerator research.
Current Status
-
RTL: Verified against a Python-based golden model.
-
Synthesis: Completed using Yosys.
-
Physical Design: GDSII generation in progress via OpenLane.
-
Next Steps: Integration into a Systolic Array architecture for matrix multiplication.
As we move into the Blackwell era of AI computing, the bottleneck shifts from compute to memory bandwidth. This RTL-to-GDSII project demonstrates that by adopting the FP4 standards seen in the latest H200/B200 specs, we can significantly reduce the silicon footprint of local AI accelerators. I’m looking for feedback on how my implementation of the exponent bias compares to the scaling factors used in NVIDIA’s MX specification.
The unit implements the E2M1 format, which is the cornerstone of the new Microscaling (MX) formats supported by Blackwell. By targeting this specific bit-width, the design achieves the same 4-bit footprint used in Blackwell’s second-generation Transformer Engine, providing a reference point for how these ultra-thin formats translate to actual silicon area and power density in an open-source flow.
With the launch of the NVIDIA Blackwell architecture, the industry has seen the first major hardware validation of FP4 for ultra-low precision AI inference. While Blackwell leverages proprietary logic for its Tensor Cores, this project aims to create an open-source equivalent—an FP4 E2M1 Arithmetic Unit—specifically for the research community. My goal is to explore the physical implementation challenges of this Blackwell-standard format using the SKY130 process.
- The Power Grid: Mention your VDD/GND strap density.
- Density Percentage: If your OpenLane report says you have a “Core Utilization” of 60% or 70%, mention that.
- Dimensions: Give the specific size in microns (e.g., $100\mu m \times 100\mu m$).
- The Metal Layers: Mention that it’s a 5-metal layer stack (typical for SKY130).
This layout was generated using the OpenLane flow. The macro occupies roughly [X] $\mu m^2$ with a target density of [Y]%. Visible are the power distribution networks (PDN) and the signal routing for the 4-bit parallel arithmetic lanes.
