VoiceVox TTS

VoiceVox is a self-hostable Japanese TTS engine. The firmware can drive it directly over HTTP to turn dynamic speech text into spoken audio, separate from the sidecar path (which owns STT + LLM). API-compatible engines such as AivisSpeech speak the same wire format and work the same way.

This page covers running the engine and pointing the firmware at it. The wire format (URL builders, the audio-query rate override, and the WAV decoder) lives in stackchan-tts::voicevox; the firmware synthesis task that fetches and enqueues audio lives in stackchan_firmware::voicevox.

When it speaks

A configured engine voices the sidecar reply on the branch where the sidecar shipped text but no audio of its own ("audio_url": null, or the field absent — synthesis failed or was skipped sidecar-side). The reply text is handed to the synthesis task, which runs the two-step round-trip and enqueues the spoken WAV on the audio TX queue at normal priority (the same rank as emotion chirps). When the sidecar does return an audio_url, that audio plays instead and VoiceVox stays quiet — the sidecar’s own TTS wins. Back-to-back replies coalesce: only the newest pending text is synthesised.

Run the engine

The upstream Docker image listens on port 50021:

docker run --rm -p 50021:50021 voicevox/voicevox_engine:cpu-latest

Confirm it answers on the LAN IP the avatar will reach (not localhost — the firmware connects from another host):

curl -s http://192.168.1.50:50021/speakers | head -c 200

Point the firmware at it

Set behavior.voicevox_url (and optionally behavior.voicevox_speaker_id) in STACKCHAN.RON:

behavior: (
    voicevox_url: "http://192.168.1.50:50021",
    voicevox_speaker_id: 1,
    // ...other behavior flags...
)

Empty (the default) disables the synthesis task. Both fields also round-trip through PUT /settings; see behavior for the full field table and reboot semantics.

Hostnames are not resolved — use a raw IPv4 literal, same as agent_sidecar_url. DNS support is a future extension.

voicevox_speaker_id selects the voice. 1 is Zundamon “ノーマル”; the engine’s GET /speakers lists the full catalogue with per-style IDs.

Sample rate: 16 kHz, not 24 kHz

VoiceVox renders at 24 kHz by default, but the CoreS3 I²S output path plays a single fixed rate (16 kHz). The firmware therefore rewrites the outputSamplingRate field in the engine’s audio-query response to 16 000 before requesting synthesis, so no on-device resampler is needed. If the engine returns a different rate the WAV decoder rejects the response rather than playing it back at the wrong pitch.

To reproduce the firmware’s two-step request by hand:

ENGINE=http://192.168.1.50:50021
# Step 1: audio query (prosody JSON), rewriting the output rate to 16 kHz.
curl -s -X POST "$ENGINE/audio_query?speaker=1&text=こんにちは" \
  | sed 's/"outputSamplingRate":[0-9]*/"outputSamplingRate":16000/' \
  > query.json
# Step 2: synthesis -> 16 kHz mono WAV.
curl -s -X POST "$ENGINE/synthesis?speaker=1" \
  -H 'Content-Type: application/json' --data @query.json -o out.wav

Security

Like the sidecar, the firmware speaks plain HTTP on the LAN — no TLS, no auth. VoiceVox upstream has no authentication of its own, so keep the engine on a trusted segment and don’t expose port 50021 to the internet.

voice — the end-to-end wake-word + sidecar onboarding flow.
behavior — the voicevox_url / voicevox_speaker_id config fields and reboot semantics.
stackchan-tts — the SpeechBackend surface and the VoiceVox wire-format helpers.

VoiceVox — self-hosted TTS engine

Clean-slate Rust firmware for the M5Stack CoreS3 Stack-chan.

VoiceVox TTS

When it speaks

Run the engine

Point the firmware at it

Sample rate: 16 kHz, not 24 kHz

Security

VoiceVox TTS

When it speaks

Run the engine

Point the firmware at it

Sample rate: 16 kHz, not 24 kHz

Security

Related