๐Ÿพ claw-stack Back to Plugins
Plugin OpenClaw Ecosystem

Voice Control

Natural voice conversations with your AI assistant โ€” from anywhere, with zero paid speech APIs

Overview

Voice Control turns your OpenClaw setup into a hands-free AI assistant. Built on WebRTC with a fully self-hosted stack โ€” MLX-Whisper for local speech recognition, Edge-TTS for free text-to-speech, and LiveKit for real-time audio. Generate a one-time link, open it on your phone, and start talking.

0
Paid STT/TTS APIs
1h
Link Expiry
Apple Silicon
Optimized For
WebRTC
Audio Transport

Capabilities

Key Features

Every component runs locally or uses free services โ€” no ongoing API costs for speech.

Local STT via MLX-Whisper

Speech recognition runs entirely on your Apple Silicon Mac using MLX-Whisper large-v3-mlx-4bit. No cloud API, no usage fees, no audio leaves your machine.

Free TTS via Edge-TTS

Text-to-speech powered by Microsoft's free Edge-TTS service โ€” high-quality natural voices with no subscription or per-character billing.

WebRTC via LiveKit

Real-time audio streaming over WebRTC, self-hosted with LiveKit. Low-latency duplex audio that works reliably from any browser or iOS Safari.

Works on iPhone

Call your AI from anywhere using your iPhone. Tailscale provides a trusted HTTPS endpoint so iOS Safari connects without certificate warnings.

Tool Calling by Voice

Ask Claude to read files, run shell commands, search your memory store, or list active sessions โ€” all triggered naturally through conversation.

Tailscale Remote Access

Accessible from anywhere on your Tailscale network. One-time call links expire after 1 hour, so every session starts fresh and secure.

Architecture

How It Works

A fully self-hosted audio pipeline โ€” from your mic to Claude's voice, nothing leaves your network.

Every call goes through a deterministic six-step pipeline. Speech is detected by Silero VAD, transcribed locally on-device by MLX-Whisper, processed by Claude, then converted back to audio by Edge-TTS โ€” all in real-time over WebRTC.

1
Mic / iPhone Audio captured via WebRTC
2
VAD Silero detects speech boundaries
3
STT MLX-Whisper transcribes locally
4
Claude LLM generates response + tool calls
5
Edge-TTS Response synthesized to audio
6
Speaker Audio streamed back via LiveKit

Getting on a call takes seconds. A single script generates a signed JWT with a fresh room name, bundles it into a Tailscale HTTPS URL, and prints a ready-to-open link.

One-Time Link Flow

โ‘ 

Run ./call.sh โ€” generates a signed JWT + unique room

โ‘ก

Link delivered via Tailscale DNS (trusted Let's Encrypt cert)

โ‘ข

Open on iPhone or browser โ€” WebRTC handshake via token server

โ‘ฃ

Audio streams through LiveKit to the voice agent

โ‘ค

Link expires after 1 hour โ€” next call, fresh link

STT

MLX-Whisper

local, Apple Silicon

TTS

Edge-TTS

free Microsoft

Transport

LiveKit

self-hosted WebRTC

Access

Tailscale

zero-config VPN

Coverage

What You Can Do by Voice

The voice agent has full access to OpenClaw tools โ€” anything you can type, you can now say.

Ask Questions

Chat naturally with Claude โ€” ask anything, brainstorm ideas, or get quick answers hands-free.

Run Commands

Execute shell commands on your Mac mini by voice โ€” no keyboard needed.

Search Memory

Query your OpenClaw memory store out loud and get spoken answers back instantly.

Read Files

Ask Claude to read any file on your system and summarize or explain its contents.

List Sessions

Find out what agents are active, what tasks are running, or what sessions exist โ€” just ask.

Control Agents

Steer sub-agents, check task status, and orchestrate your OpenClaw setup from anywhere.

Coming Soon

Join the Waitlist

Voice Control is in private early access. Get notified when it's available.

Join the Waitlist