SOFTWARE STACK · ONNX-TO-BINARY COMPILER
7 stages. ONNX in. Binary out.
The NeuraEdge compiler transforms standard ONNX models into optimized binary instructions for the 256-MAC systolic array. Every stage is implemented in Python, fully readable, and extensible for custom operators.
COMPILER PIPELINE
From model to silicon in 7 stages.
ONNX Parser
Loads .onnx model, validates graph topology, extracts operator nodes and tensor shapes
Graph Optimizer
Fuses consecutive operators, eliminates dead nodes, constant-folds where possible
Quantizer
INT8 quantization with per-tensor calibration. Preserves accuracy via min/max range mapping
Tensor Tiler
Splits large tensors into 2×2 tile-sized chunks that fit PE-local SRAM (512B per PE)
Systolic Scheduler
Generates execution schedule for 256-MAC systolic array. Handles data dependencies and tile ordering
Descriptor Generator
Produces hardware instruction descriptors: DMA commands, PE configurations, NoC routing tables
Binary Generator
Emits final .npu binary with instruction stream, weight blobs, and activation buffers
SUPPORTED OPERATORS
14 operators. Categorized by function.
Additional operators available in v2.0 roadmap. Extension for custom operators is documented in the compiler architecture guide.
BENCHMARK RESULTS
Verified models. Measured performance.
All benchmarks are simulation-derived on the SKY130A process node using gate-level simulation with SPEF-extracted parasitics. No silicon measurement exists.
| Model | Latency | Energy | INT8 Similarity | Tiles | Binary |
|---|---|---|---|---|---|
| ResNet-18 | 190.9 ms | 3.59 mJ | 0.911 | 25,864 | 404 KB |
| DS-CNN | 1.20 ms | 22.5 µJ | 0.934 | 1,247 | 28 KB |
| MobileNetV2 | 391 ms | — | 0.887 | 41,203 | 612 KB |
All benchmarks simulation-only (SKY130A). No silicon measurement exists.
OUTPUT FORMATS
Four output formats. One pipeline.
Raw binary instruction stream + weight blobs
Intel HEX format for firmware flashing
C header + source for embedded integration
Platform-specific firmware wrapper with driver API
OPERATOR COVERAGE
Operator coverage across common ONNX model families.
This matrix shows exactly which models work today versus which need v1.1 or v2.0 compiler support. It prevents post-purchase surprise.
| Operator | v1.0 | v1.1 | v2.0 | Models requiring it |
|---|---|---|---|---|
| Conv2D | ✓ | — | — | All CNNs |
| MatMul / GEMM | ✓ | — | — | Transformers, BERT |
| ReLU / ReLU6 | ✓ | — | — | MobileNet, ResNet |
| MaxPool | ✓ | — | — | ResNet, EfficientNet |
| GlobalAvgPool | ✓ | — | — | MobileNetV2 |
| Concat / Reshape | ✓ | — | — | YOLO variants |
| BatchNorm | ✓ | — | — | Most CNNs |
| DepthwiseConv2D | — | ✓ | ✓ | MobileNetV2, EfficientNet |
| TransposeConv | — | — | ✓ | YOLO decoder, GANs |
| LayerNorm | — | — | ✓ | BERT-tiny, transformers |
| GELU / SiLU | — | — | ✓ | EfficientNet, YOLOv8 |
| Attention (MHSA) | — | — | ✓ | BERT, ViT |
| Slice / Gather | — | — | ✓ | YOLO, detection heads |
COMPILER FREQUENCY MIGRATION
v2.0 compiler update required for TSMC 40nm.
v1.0 compiler targets: 50 MHz, SKY130A SRAM latencies
v2.0 compiler update required for TSMC 40nm:
— hardware_config_tsmc40nm.json: updated clock period, SRAM read latency, write latency, pipeline cycle counts
— Regression suite: cosine similarity re-verification for all supported models at 400 MHz timing
— ResNet-18 baseline: 0.911 INT8/FP32 cosine similarity to be re-verified post-migration
No RTL changes required for frequency migration. Compiler configuration file update only.
FULL SOFTWARE DELIVERY PACKAGE
32 files. Compiler, drivers, SDK, examples.
Every word of this is a competitive advantage against vendors who deliver encrypted RTL with no software.
Compiler (14 Python modules)
Firmware / drivers (14 files)
SDK + Example
Evaluate the compiler yourself.
Schedule a technical review and walk through the compiler pipeline live. We will compile your model on screen and show you the binary output.