Compiling Your TFLite Model to C++

This guide outlines the steps required to compile a new TFLite model into your project using the Vela compiler.

Prerequisites

Ensure the following are installed on your system:

Python 3.10 or later
Visual Studio C++ Build Tools 14 or later (Windows only)
A quantized TFLite model (INT8)

Instructions

Step 1: Install Vela Compiler

a. Install via pip:

pip install ethos-u-vela

This installs the latest version of the Vela compiler from PyPI.

b. Verify Installation:

vela-version

Expected version: 4.2.0

Step 2: Create a Virtual Environment

Navigate to the desired location and run:

Windows:
```
python -m venv <V_ENV_NAME>
```
Linux:
```
python -m venv /<V_ENV_NAME>
```

Step 3: Activate the Virtual Environment

Windows:
```
<V_ENV_NAME>\Scripts\activate.bat
```
Linux:
```
source <V_ENV_NAME>/bin/activate
```

Step 4: Navigate to the MCU SDK Inference Directory

Go to:

<MCU SDK>/tools/Inference

Step 5: Install Additional Dependencies

Run:

pip install -r requirements.txt

Step 6: Prepare Your TFLite Model

Copy the .tflite file into <MCU SDK>/tools/Inference
OR
Use the full path to the model in the next steps.

Step 7: Rename Your TFLite Model

Ensure the filename contains no special characters or spaces.

Step 8: Compile the TFLite File

Use the script:

python infer_code_gen.py -t <path_to_tflite_model> [-o <output_directory>] [-n <namespace>] [-s <scripts>] [-i <input_files>] [-c <compiler>] [-tl <tflite_location>] [-p <optimization_strategy>]

Step 9: Verify Output Files

After successful compilation, these files will appear in the output directory:

model.cc – model weights
model_io.cc – randomized input & expected output

Step 10: Rename the Compiled Model

Rename the output Model.tflite model to:

<MODULE_NAME>.bin

Step 11: Model.bin can be generated from both VS code and Synatoolkit.

Step 11.1: Generate Model.bin Using VS Code

Use this .bin file with the VS Code Image Converter to produce the final model.bin

Refer to: Astra MCU SDK VSCode Extension User Guide .

Step 11.2: Generate Model.bin Using Synatoolkit

Use this .bin file with the Synatoolkit Image Generator to produce the final model.bin

Refer to: SynaToolkit.

Commands for Inference Code Generation

For SRAM Optimization

Windows:

python infer_code_gen.py -t .<MODEL_NAME>.tflite -o <OUT_DIR> -p Size -t1 1

Linux:

python infer_code_gen.py -t /<MODEL_NAME>.tflite -o <OUT_DIR> -p Size -t1 1

For Flash Optimization

Windows:

python infer_code_gen.py -t <MODEL_NAME>.tflite -o <OUT_DIR> -p Performance -tl 2

Linux:

python infer_code_gen.py -t <MODEL_NAME>.tflite -o <OUT_DIR> -p Performance -tl 2

Note:

The infer_code_gen.py script allows for performance tuning via the --arena-cache-size(1MB, 1.25MB, 1.5MB etc) parameter within its vela_params (see lines ~98-99). Experimenting with this value can help optimize memory footprint (e.g., Total SRAM used, Total On-chip Flash used) and inference speed.

vela_params = ['vela', '--output-dir', os.path.dirname(args.tflite_path), '--accelerator-config=ethos-u55-128' , '--optimise=' + args.optimize, '--config=Arm\\vela.ini', memory_mode, '--system-config=Ethos_U55_High_End_Embedded', args.tflite_path,'--arena-cache-size=1500000']

Vela output

This section provides a summary of the model compilation results generated by Arm Vela, detailing the network’s characteristics and estimated performance on the Ethos-U55 NPU.

Network summary for hl
Accelerator configuration          Ethos_U55_128
System configuration               Ethos_U55_High_End_Embedded
Memory mode                        Sram_Only
Accelerator clock                  500 MHz
Design peak SRAM bandwidth         3.73 GB/s
Design peak On-chip Flash bandwidth  3.73 GB/s

Total SRAM used                    350.00 KiB
Total On-chip Flash used           1322.48 KiB

CPU operators = 4 (6.0%)
NPU operators = 63 (94.0%)

Average SRAM bandwidth             1.68 GB/s
Input   SRAM bandwidth             18.34 MB/batch
Weight  SRAM bandwidth             0.00 MB/batch
Output  SRAM bandwidth             6.93 MB/batch
Total   SRAM bandwidth             25.27 MB/batch
Total   SRAM bandwidth             per input     25.27 MB/inference (batch size 1)

Average On-chip Flash bandwidth    0.23 GB/s
Input   On-chip Flash bandwidth    0.00 MB/batch
Weight  On-chip Flash bandwidth    3.24 MB/batch
Output  On-chip Flash bandwidth    0.00 MB/batch
Total   On-chip Flash bandwidth    3.44 MB/batch
Total   On-chip Flash bandwidth    per input      3.44 MB/inference (batch size 1)

Original Weights Size              2688.16 KiB
NPU Encoded Weights Size           819.64 KiB

Neural network macs                331175776 MACs/batch

Info: The numbers below are internal compiler estimates.
For performance numbers the compiled network should be run on an FVP Model or FPGA.

Network Tops/s                     0.04 Tops/s

NPU cycles                         7447496 cycles/batch
SRAM Access cycles                 3378876 cycles/batch
DRAM Access cycles                       0 cycles/batch
On-chip Flash Access cycles        450977 cycles/batch
Off-chip Flash Access cycles             0 cycles/batch
Total cycles                       7542742 cycles/batch

Memory Allocation Notes

When configuring memory for your project, keep the following in mind:

Tensor Arena Size: This should be set to at least the Total SRAM used value provided in the Vela output. We strongly recommend adding a minimum of 10KB extra to this as a buffer to ensure smooth operation and account for any runtime overheads.
Model Weights: The model’s weights will reside in the space indicated by Total On-chip Flash used.