Voice agent
End-to-end setup for the operator-visible “say a wake phrase, get a reply” path. The firmware ships with no LLM, STT, or TTS — instead it runs an on-device wake-word detector and forwards captured audio to an HTTP sidecar the operator points it at. Everything is opt-in; without configuration the device stays local-first and the voice path is inert.
This page is the onboarding flow. For the wire protocol the firmware speaks to the sidecar, see Sidecar agent; for the underlying audio capture path, see UDP audio debug.
The loop, as seen on the device
- Wake — wake word fires, or the operator hits
POST /listenon the dashboard. The avatar plays an acknowledge chirp, the head eases into a cocked-head listen pose, and the ear decorator appears in the upper-left of the face. - Capture — the firmware records ~4 s of 16 kHz mono PCM from
the ES7210 microphone (the wake-word fire defaults to 4 000 ms;
POST /listendefaults to 3 000 ms; the body’sduration_msfield overrides). - Round-trip — the captured PCM uploads to the sidecar over plain HTTP/1.1. The ear fades and a thought-bubble appears at the upper-right while the request is in flight.
- Reply — the sidecar responds with
{"text": "...", "emotion": "...", "audio_url": "..."}. The thought-bubble fades, the avatar’s emotion mirrors the tag, and the reply text scrolls in the toast band beneath the face. - Speak (optional) — if the sidecar synthesised TTS, the firmware fetches the audio URL and plays the reply through the AW88298 amp. Audio is fetched lazily after the text + emotion surface, so a slow cloud TTS provider never delays the visible reply.
If the sidecar fails or times out, the thought-bubble fades, the
face flips to Sad for ~2.5 s, and a warn-class toast (sidecar:
post failed, sidecar: timed out, sidecar: link down) explains
what happened. The avatar returns to autonomous behavior after the
hold expires. If text + emotion shipped but the audio fetch failed,
the avatar plays no audio and the failure stays log-only — the
visible reply is already on screen and “missing voice” doesn’t
warrant a face-level Sad beat.
Prerequisites
- Wi-Fi configured via
/sd/STACKCHAN.RONso the firmware has a route to the sidecar. SD-less boots and units without a Wi-Fi station keep the voice task parked. - SD card present and writable. The runtime config + wake-word model both live on it.
- Optional but recommended: a sidecar reachable from the device’s LAN, served on a raw IPv4 literal (DNS resolution is not wired up).
Enable the sidecar (cloud-or-not, you choose)
Add the agent block to /sd/STACKCHAN.RON:
behavior: (
agent_sidecar_url: "http://192.168.1.42:8080/v1/listen",
agent_sidecar_token: "sk-sidecar-shared-secret",
// ... other behavior flags ...
)
Empty agent_sidecar_url parks the agent task entirely — the
firmware still runs the cosmetic listen window (ear decorator + ack
chirp) on every POST /listen, but never posts audio anywhere.
The dashboard’s Settings panel exposes both fields and round-trips
them through PUT /settings with ***-redaction on GET, the same
shape as the Wi-Fi PSK and the dashboard auth token.
Reference sidecar
The repo’s sidecar/ directory carries a
working implementation: Python 3.12 + FastAPI,
faster-whisper for STT, configurable LLM backend (Anthropic, OpenAI,
or Ollama), per-X-Session-Id conversation memory, structured JSON
logs, /healthz probe, Dockerfile, and an example systemd unit.
Operators who want a turnkey loop run uv sync && uv run
stackchan-sidecar inside sidecar/ and point the firmware at it.
The persona prompt is a single editable markdown file at
sidecar/personas/stack-chan.md. Copy it for a different voice.
Multiple personas on one sidecar
One sidecar can serve several personas — the firmware picks which
one to use per request via the X-Persona-Name header.
- Drop more persona files into
sidecar/personas/, one.mdper voice (desk-buddy.md,wake-only.md, etc.). Each follows the same frontmatter shape as the bundledstack-chan.md. - Set
persona_name: "desk-buddy"in thebehavior:block of the device’s/sd/STACKCHAN.RON(RON uses:between key and value, not=). Equivalent JSON forPUT /settings:"persona_name": "desk-buddy". Empty / unset keeps the firmware’s wire surface header-free, and the sidecar falls back to its baked-insettings.persona. - Confirm with
curl http://sidecar:port/v1/personas— the response lists every deployed slug plus the configured default and adefault_deployedflag that tells you upfront whether the install is healthy.
Validation runs at both boundaries: the firmware rejects slugs
longer than 64 bytes, with control characters, or containing path
separators / ..; the sidecar re-checks per request so a curl
caller bypassing the firmware can’t pivot the filesystem lookup.
Invalid slug → 400; well-formed but missing file → 404;
sidecar’s default missing → 500 (operator misconfig, not your
caller’s fault).
Conversation memory is partitioned per (session_id, persona).
A device that switches personas under the same session_id (a
firmware reflash, or a future mid-runtime swap) gets a clean
slate for the new voice — the old voice’s history stays
addressable under its own bucket until TTL expires.
Pick a TTS provider (optional)
Without TTS configured, the sidecar still ships text + emotion and
the firmware renders them on the toast band. Adding a TTS provider
makes the avatar actually speak the reply over the AW88298. All
providers emit raw 16 kHz mono s16 LE PCM, cached behind a
short-lived /v1/audio/<token> URL the firmware fetches lazily
after the reply text surfaces.
Set tts_provider in the sidecar’s .env:
| Provider | Setup cost | Voice quality | Per-call cost |
|---|---|---|---|
espeak_ng (default) |
Install espeak-ng binary (apt install espeak-ng / brew install espeak-ng). No model file, no API key. |
Robotic — intentionally so. The desk-toy aesthetic. | Free, local. |
piper |
Install piper binary + download an ONNX voice model + JSON metadata. ~50 MB per voice. |
Competent neural. The middle option — much clearer than espeak-ng without an API bill. | Free, local. |
elevenlabs |
Set ELEVENLABS_API_KEY + (optionally) elevenlabs_voice_id. Requires the Starter tier ($5/mo) or higher — the free tier returns MP3 only, and the sidecar asks for raw PCM at 16 kHz so the firmware skips an MP3 decoder. |
Human-grade. Indistinguishable from a recording. | Cloud API; per-character billing. |
Example .env:
# espeak-ng (default) — robot voice, zero setup
SIDECAR_BEARER_TOKEN=...
ANTHROPIC_API_KEY=...
tts_provider=espeak_ng
# OR: piper — local neural
tts_provider=piper
piper_model_path=/home/operator/piper-voices/en_US-libritts_r-medium.onnx
piper_speaker_id=0 # optional, for multi-speaker models
# OR: elevenlabs — cloud neural
tts_provider=elevenlabs
ELEVENLABS_API_KEY=sk_...
elevenlabs_voice_id=21m00Tcm4TlvDq8ikWAM # default: Rachel
elevenlabs_model_id=eleven_turbo_v2_5 # default: latency-optimised
The sidecar refuses to start on misconfig — e.g. tts_provider=piper
with no piper_model_path, or tts_provider=elevenlabs with no
api key — so a typo doesn’t silently fall back to text-only.
If TTS fails per request (synthesis exception, ElevenLabs returns
401, piper binary missing), the sidecar still ships text + emotion
with audio_url: null. The firmware sees the null, skips the fetch
step, and the reply appears on screen with no audio. Watch the
sidecar logs for TTSError(stage="...") to diagnose.
For a self-hosted Japanese TTS engine the firmware drives directly (no sidecar in the path), see VoiceVox.
Enable the wake word (optional)
Wake-word detection is local-only — microWakeWord v2 streaming inference via TFLite Micro + ESP-NN on the ESP32-S3. Two pieces both have to be in place or the wake task parks at boot.
1. Drop a model on the SD card
Place a TFLite model at /sd/WAKE_WORD.tflite. The firmware reads
it once at boot and never re-reads — config changes via
PUT /settings take effect on the next reboot.
Compatible models: any int8-quantized streaming microWakeWord v2
.tflite from the microWakeWord
library — hey_jarvis,
okay_nabu, alexa, custom-trained, and so on. The 20-operator
set the firmware registers matches what ESPHome’s streaming_model.cpp
exposes, so any model that loads under ESPHome should load here.
To train a model for your own wake phrase, see
tools/kws-trainer/: kws-record
captures samples over UDP from the firmware, kws-build-dataset
turns a labelled directory into a train/val/test manifest, and
kws-eval runs a trained .tflite against a WAV to tune
wake_word_threshold without reflashing.
A unit booted without /sd/WAKE_WORD.tflite still works — the wake
task parks; the sidecar still fires on POST /listen. The model is
only required for the hands-free path.
2. Flip the config flag
behavior: (
wake_word_enabled: true,
wake_word_threshold: 100, // int8 cut-point; 100 ≈ 0.95 confidence
wake_word_arena_kib: 64, // TFLM tensor arena, in KiB
// ... other behavior flags ...
)
wake_word_enabled— falls back tofalseif the field is missing or the model isn’t found. Both have to be true for the task to wire up.wake_word_threshold— int8 score above which the firmware treats the model output as a positive detection. Lower values (≈80) increase sensitivity at the cost of false positives; higher values (≈115) tighten it. Tune per model.kws-evalreports the model’s dequantized output (float in[0.0, 1.0]) while the firmware compares against the raw int8. Convert by inverting the model’s output quantization:wake_word_threshold = round(float_score / output_scale) + output_zero_point. For the canonical microWakeWord output quant (scale=1/255, zero_point=-128), akws-evalpeak of0.89maps to~99;0.95maps to~114. Check your model’s exact quant via the TFLite Python API if the canonical values don’t apply.wake_word_arena_kib— size of the TFLM tensor arena in KiB. The microWakeWord v2 family declares 17–26 KiB nominal;64leaves headroom for the TFLM planner’s scratch space. Bump if a custom.tflitefailsallocate_tensorsat boot (the firmware logs a warning and parks).
A successful boot logs something like:
INFO wake-word: interpreter ready (1 inputs, 1 outputs, arena=65536 B, threshold=100)
When the model fires, the log line is wake-word: fired (score=…).
Verify
Without the wake word
Smoke-test the sidecar leg first — it removes a moving part.
$HTTP_TOKEN below is the device-level bearer token configured as
auth.token in STACKCHAN.RON (distinct from agent_sidecar_token,
which the firmware presents to the sidecar). Drop the
Authorization header entirely if auth.token is empty.
curl -H "Authorization: Bearer $HTTP_TOKEN" \
-H "Content-Type: application/json" \
-d '{"duration_ms": 3000}' \
http://stackchan.local/listen
The avatar should chirp, show the ear decorator, capture 3 s of audio, then post it to the sidecar. The reply text lands in the toast band; if the reply tagged an emotion, the face flips for ~2.5 seconds before returning to autonomous behavior.
With the wake word
Reboot the unit, watch the defmt monitor for wake-word: interpreter
ready, then speak the wake phrase. The path through the firmware is
identical to the route-driven test above — the wake-word detector
fires the same RemoteCommand::StartListen the HTTP route does.
The cooldown after a fire is 5 seconds (capture + 1 s padding), so back-to-back triggers within that window are suppressed silently.
Troubleshooting
| Symptom | First thing to check |
|---|---|
wake-word: disabled (enabled=…, model=…); idle at boot |
Either the config flag is false or /sd/WAKE_WORD.tflite wasn’t found. |
wake-word: allocate_tensors failed at boot |
Bump wake_word_arena_kib; the model needs more arena than the default. |
wake-word: Interpreter::new failed (bad model or schema mismatch) |
The .tflite is malformed or uses operators outside the registered set. Try a stock microWakeWord library model first. |
| Wake fires constantly on ambient noise | Raise wake_word_threshold (try 110 → 120 in steps). |
| Wake never fires on real utterances | Lower wake_word_threshold (try 90 → 80 in steps). |
sidecar: link down toast |
Wi-Fi disconnected between the wake fire and the POST. Check the dashboard’s network status. |
sidecar: post failed toast |
Sidecar unreachable, returned non-2xx, replied with malformed JSON, or the JSON body has no text field. Try curling the sidecar’s /healthz directly and confirm the response includes a top-level "text": "..." key. |
sidecar: timed out toast |
Sidecar took longer than 15 s. Almost always means the LLM call upstream is slow; the firmware bound is fixed. |
| Avatar visibly listens but the reply never renders | Sidecar returned a 2xx but the body has no text field. The firmware drops the exchange and surfaces sidecar: post failed. |
| Audio capture sounds wrong / silent | Use the UDP audio debug stream to verify the microphone path independently of the sidecar. |
The firmware logs every fire and every failure via defmt::info! /
defmt::warn! over the USB-Serial-JTAG monitor. When something
breaks, just reattach is the fastest way to see what the device
saw without rebooting it.
Security posture
Both legs are LAN-only and operator-trusted:
- The wake model runs entirely on-device; no audio leaves the unit
until the wake fires (or
POST /listendoes). - The sidecar POST is plain HTTP (no TLS) with optional bearer-token auth. Safe on a trusted LAN; never expose the sidecar to the open internet without a reverse proxy that terminates TLS and re-issues the bearer.
- The captured PCM does not stream — the firmware uploads it in one request at the end of the capture window.
Per-tenant or per-session policy belongs in the sidecar (where the LLM and STT live). The firmware only ferries audio and renders the reply.