Compiling a TFLite Model with Vela (SR110)ο
This guide explains how to compile a quantized TFLite model using the Vela compiler and generate the C++ sources used by SR110 inference examples. It uses the SDK inference tool under tools/Inference.
Throughout this guide, <sdk-root> refers to the directory where you extracted or cloned the SDK.
Table of Contentsο
Prerequisitesο
A quantized INT8
.tflitemodel.Python 3.7β3.10
Visual Studio C++ Build Tools (Windows only, required for some Python packages).
Recommendation: Use a separate venv for inference tools. The SDK build uses Python 3.13, which is not compatible with the inference tool dependencies.
Set up a Virtual Environmentο
Create and activate a venv. Make sure the python you use is 3.7β3.10:
python --version
Windows:
python -m venv my_venv
my_venv\\Scripts\\activate.bat
Linux/macOS:
python -m venv ~/my_venv
source ~/my_venv/bin/activate
Keep the venv active for the rest of this guide.
Install Velaο
The inference tool expects the vela command on PATH.
Install Vela from the repository and checkout the validated version:
git clone https://review.mlplatform.org/ml/ethos-u/ethos-u-vela.git
cd ethos-u-vela
git checkout 4.2
pip install .
Verify:
vela --version
Install requirements from the SDK inference folder:
cd <sdk-root>/tools/Inference
pip install -r requirements.txt
Run the Inference Toolο
From <sdk-root>/tools/Inference:
python infer_code_gen.py -t <path_to_tflite_model> \
[-o <output_directory>] \
[-n <namespace>] \
[-s <scripts>] \
[-i <input_files>] \
[-c <compiler>] \
[-tl <tflite_location>] \
[-p <optimization_strategy>]
Key options (from the script):
-c/--compiler:vela(default) ornone-p/--optimize:Performance(default) orSize-tl/--tflite_loc:1= SRAM,2= FLASH-s/--script:modeland/orinout(default runs both)-i/--input: optional.npy/.bininputs for expected output generation
About -tl: This switch affects both Velaβs memory planning and the generated C++ attribute.
-tl 1 uses SRAM (--memory-mode=Sram_Only and MODEL_TFLITE_ATTRIBUTE).
-tl 2 targets flash/QSPI (--memory-mode=Shared_Sram and MODEL_TFLITE_ATTRIBUTE_FLASH).
You still need the VS Code Image Conversion step to produce a flashable model binary.
Tuning Vela memory planning: Vela supports --arena-cache-size <bytes> to cap the arena it assumes during compilation.
infer_code_gen.py does not expose this flag. To use it, either:
Run Vela manually with
--arena-cache-size, then generate code/IO without re-compiling:vela --arena-cache-size <bytes> --output-dir <OUT_DIR> <MODEL_NAME>.tflite python infer_code_gen.py -t <OUT_DIR>/<MODEL_NAME>_vela.tflite -c none -o <OUT_DIR>
Or, add
--arena-cache-sizeto thevela_paramslist insidetools/Inference/infer_code_gen.py:vela_params = ['vela', '--output-dir', args.output_dir, '--accelerator-config=ethos-u55-128', \ '--optimise=' + args.optimize, '--config=Arm\\vela.ini', memory_mode, \ '--system-config=Ethos_U55_High_End_Embedded', args.tflite_path, '--arena-cache-size=1500000']
Outputsο
In the output directory you will see:
<namespace>.cc(model C++ source + resolver content)<namespace>_io.cc(input/expected output data)<model>_vela.tflite(when-c vela)output_*.binandoutput_*.npy(expected outputs)<namespace>_micro_mutable_op_resolver.hpp(intermediate header, appended into<namespace>.cc)
Prepare a Flashable Model Binary (VS Code)ο
If you plan to place model weights in flash, you must convert the Vela output into a flashable model binary using the Astra MCU SDK VS Code Extension:
Rename the Vela output
<model>_vela.tflitefrom.tfliteto.bin(the contents are unchanged).In VS Code, open Build and Deploy β Image Conversion.
Use the Advanced Configurations options to generate a Model Binary from the renamed
.bin.
For details on the Image Conversion workflow, see Astra MCU SDK VS Code Extension User Guide.
Common Usage Examplesο
Size-optimized (SRAM):
python infer_code_gen.py -t <MODEL_NAME>.tflite -o <OUT_DIR> -p Size -tl 1
Performance-optimized (FLASH):
python infer_code_gen.py -t <MODEL_NAME>.tflite -o <OUT_DIR> -p Performance -tl 2
Notesο
Filenames should avoid spaces or special characters.
The inference tool is maintained under
tools/Inference. If behavior changes, check the inference tool README and the script help output:python infer_code_gen.py -h
Memory Allocation Notesο
When configuring memory for your project, keep the following in mind:
Tensor Arena Size: Set this to at least the
Total SRAM usedvalue printed by Vela. Add a minimum of 10KB extra as a buffer for runtime overhead.The tensor arena size is not set by
infer_code_gen.py. The-p Sizeoption only changes Velaβs optimization strategy, not your arena allocation.Set the arena size in your application code (for example,
TENSOR_ARENA_SIZEinexamples/SR110_RDK/vision_examples/<usecase>/infer.ccor the arena buffer inexamples/SR110_RDK/inference_examples/<app>/<app>.cc).Use
get_used_tensor_arena_size()at runtime to size it properly, then keep a small safety margin.
Model Weights: Weights reside in the space indicated by
Total On-chip Flash usedin the Vela output.