Compiling Your TFLite Model to C++
This guide outlines the steps required to compile a new TFLite model into your project using the Vela compiler.
Prerequisites
Ensure the following are installed on your system:
Python 3.10 or later
Visual Studio C++ Build Tools 14 or later (Windows only)
A quantized TFLite model (INT8)
Instructions
Step 1: Install Vela Compiler
a. Install via pip:
pip install ethos-u-vela
This installs the latest version of the Vela compiler from PyPI.
b. Verify Installation:
vela-version
Expected version: 4.2.0
Step 2: Create a Virtual Environment
Navigate to the desired location and run:
Windows:
python -m venv <V_ENV_NAME>
Linux:
python -m venv /<V_ENV_NAME>
Step 3: Activate the Virtual Environment
Windows:
<V_ENV_NAME>\Scripts\activate.bat
Linux:
source <V_ENV_NAME>/bin/activate
Step 5: Install Additional Dependencies
Run:
pip install -r requirements.txt
Step 6: Prepare Your TFLite Model
Copy the .tflite file into <MCU SDK>/tools/Inference
OR
Use the full path to the model in the next steps.
Step 7: Rename Your TFLite Model
Ensure the filename contains no special characters or spaces.
Step 8: Compile the TFLite File
Use the script:
python infer_code_gen.py -t <path_to_tflite_model> [-o <output_directory>] [-n <namespace>] [-s <scripts>] [-i <input_files>] [-c <compiler>] [-tl <tflite_location>] [-p <optimization_strategy>]
Step 9: Verify Output Files
After successful compilation, these files will appear in the output directory:
model.cc– model weightsmodel_io.cc– randomized input & expected output
Step 10: Rename the Compiled Model
Rename the output Model.tflite model to:
<MODULE_NAME>.bin
Step 11: Model.bin can be generated from both VS code and Synatoolkit.
Step 11.1: Generate Model.bin Using VS Code
Use this .bin file with the VS Code Image Converter to produce the final model.bin
Refer to: Astra MCU SDK VSCode Extension User Guide .
Step 11.2: Generate Model.bin Using Synatoolkit
Use this .bin file with the Synatoolkit Image Generator to produce the final model.bin
Refer to: SynaToolkit.
Commands for Inference Code Generation
For SRAM Optimization
Windows:
python infer_code_gen.py -t .<MODEL_NAME>.tflite -o <OUT_DIR> -p Size -t1 1
Linux:
python infer_code_gen.py -t /<MODEL_NAME>.tflite -o <OUT_DIR> -p Size -t1 1
For Flash Optimization
Windows:
python infer_code_gen.py -t <MODEL_NAME>.tflite -o <OUT_DIR> -p Performance -tl 2
Linux:
python infer_code_gen.py -t <MODEL_NAME>.tflite -o <OUT_DIR> -p Performance -tl 2
Note:
The infer_code_gen.py script allows for performance tuning via the --arena-cache-size(1MB, 1.25MB, 1.5MB etc) parameter within its vela_params (see lines ~98-99). Experimenting with this value can help optimize memory footprint (e.g., Total SRAM used, Total On-chip Flash used) and inference speed.
vela_params = ['vela', '--output-dir', os.path.dirname(args.tflite_path), '--accelerator-config=ethos-u55-128' , '--optimise=' + args.optimize, '--config=Arm\\vela.ini', memory_mode, '--system-config=Ethos_U55_High_End_Embedded', args.tflite_path,'--arena-cache-size=1500000']
Vela output
This section provides a summary of the model compilation results generated by Arm Vela, detailing the network’s characteristics and estimated performance on the Ethos-U55 NPU.
Network summary for hl
Accelerator configuration Ethos_U55_128
System configuration Ethos_U55_High_End_Embedded
Memory mode Sram_Only
Accelerator clock 500 MHz
Design peak SRAM bandwidth 3.73 GB/s
Design peak On-chip Flash bandwidth 3.73 GB/s
Total SRAM used 350.00 KiB
Total On-chip Flash used 1322.48 KiB
CPU operators = 4 (6.0%)
NPU operators = 63 (94.0%)
Average SRAM bandwidth 1.68 GB/s
Input SRAM bandwidth 18.34 MB/batch
Weight SRAM bandwidth 0.00 MB/batch
Output SRAM bandwidth 6.93 MB/batch
Total SRAM bandwidth 25.27 MB/batch
Total SRAM bandwidth per input 25.27 MB/inference (batch size 1)
Average On-chip Flash bandwidth 0.23 GB/s
Input On-chip Flash bandwidth 0.00 MB/batch
Weight On-chip Flash bandwidth 3.24 MB/batch
Output On-chip Flash bandwidth 0.00 MB/batch
Total On-chip Flash bandwidth 3.44 MB/batch
Total On-chip Flash bandwidth per input 3.44 MB/inference (batch size 1)
Original Weights Size 2688.16 KiB
NPU Encoded Weights Size 819.64 KiB
Neural network macs 331175776 MACs/batch
Info: The numbers below are internal compiler estimates.
For performance numbers the compiled network should be run on an FVP Model or FPGA.
Network Tops/s 0.04 Tops/s
NPU cycles 7447496 cycles/batch
SRAM Access cycles 3378876 cycles/batch
DRAM Access cycles 0 cycles/batch
On-chip Flash Access cycles 450977 cycles/batch
Off-chip Flash Access cycles 0 cycles/batch
Total cycles 7542742 cycles/batch
Memory Allocation Notes
When configuring memory for your project, keep the following in mind:
Tensor Arena Size: This should be set to at least the
Total SRAM usedvalue provided in the Vela output. We strongly recommend adding a minimum of 10KB extra to this as a buffer to ensure smooth operation and account for any runtime overheads.Model Weights: The model’s weights will reside in the space indicated by
Total On-chip Flash used.