Push-to-talk, WebSocket STT, multi-backend audio capture, and resilience patterns in Claude Code.
Claude Code ships a push-to-talk voice input pipeline that converts spoken speech into prompt text. Hold a key, speak, release — the transcript lands in the prompt input exactly where your cursor was. The system is gated behind two independent guards: a GrowthBook feature flag (VOICE_MODE) and an Anthropic OAuth token. Neither alone is sufficient.
┌──────────────┐ keypress ┌───────────────────┐ PCM chunks ┌──────────────────────┐
│ useVoiceInt- │ ──────────► │ useVoice.ts │ ────────────► │ voiceStreamSTT.ts │
│ egration.tsx │ │ (hold detection) │ │ (WebSocket STT) │
└──────────────┘ └────────┬──────────┘ └──────────┬───────────┘
│ startRecording() │ TranscriptText
┌────────▼──────────┐ ┌──────────▼───────────┐
│ voice.ts │ │ Anthropic voice_stream│
│ (audio backend) │ │ /api/ws/speech_to_ │
│ NAPI / arecord │ │ text/voice_stream │
│ / SoX │ └──────────────────────┘
└───────────────────┘
The four key files:
voice/voiceModeEnabled.ts — auth + kill-switch checksservices/voice.ts — audio recording backends (NAPI, arecord, SoX)services/voiceStreamSTT.ts — WebSocket client for Anthropic's STT endpointhooks/useVoice.ts — React hook wiring audio → WS → transcripthooks/useVoiceIntegration.tsx — prompt-input integration, hold-threshold, interim renderingThree functions form a layered gate:
export function isVoiceGrowthBookEnabled(): boolean {
// feature('VOICE_MODE') is a compile-time constant (Bun bundler)
// Dead code is eliminated in non-ANT builds — keeps binary size down.
return feature('VOICE_MODE')
? !getFeatureValue_CACHED_MAY_BE_STALE('tengu_amber_quartz_disabled', false)
: false
}
The GrowthBook flag tengu_amber_quartz_disabled defaults to false (not disabled). A missing or stale disk cache reads as "not killed" — so fresh installs work immediately. Flipping the flag to true emergency-disables voice for all users within GrowthBook's next cache refresh cycle.
export function hasVoiceAuth(): boolean {
if (!isAnthropicAuthEnabled()) return false
const tokens = getClaudeAIOAuthTokens()
return Boolean(tokens?.accessToken)
}
getClaudeAIOAuthTokens() spawns security on macOS (~20-50ms cold, cached afterward). The memoize clears on token refresh (~once per hour), so one cold spawn per session is expected. API keys, Bedrock, Vertex, and Foundry all return false here — voice_stream is only available via Claude.ai OAuth.
Enabling voice runs five sequential checks before writing voiceEnabled: true to settings:
CLAUDE_CODE_REMOTE, Homespace) immediately; then probes the audio backend chain.// Fires TCC dialog early — better than a surprise on first hold-to-talk
if (!(await requestMicrophonePermission())) {
const guidance = process.platform === 'darwin'
? 'System Settings → Privacy & Security → Microphone'
: 'your system\'s audio settings'
return { type: 'text', value: `Microphone access denied. Go to ${guidance}.` }
}
The recording layer presents a single startRecording(onData, onEnd, options?) interface over three possible backends, tried in priority order:
| Backend | Platforms | Notes | Status |
|---|---|---|---|
| audio-capture-napi | macOS, Linux (ALSA cards present), Windows | In-process via cpal + CoreAudio/AudioUnit. dlopen blocks ~1s warm, ~8s cold on macOS — loaded lazily on first voice keypress only. |
Primary |
| arecord | Linux only | ALSA userspace utility. Probed via 150ms race: if still alive = device opened OK. Handles WSL2+WSLg (Win11) via PulseAudio RDP pipes; fails on WSL1 / Win10-WSL2. | Fallback 1 |
| SoX (rec) | macOS, Linux | External process piping raw PCM. Requires --buffer 1024 to prevent internal buffering delay. Has built-in silence detection; arecord does not. |
Fallback 2 |
| Windows (no native) | Windows only | No subprocess fallback on Windows. Native module required. | No fallback |
On WSL1 and headless Linux, arecord is installed but open() fails immediately because there is no ALSA card and no PulseAudio server. hasCommand('arecord') returns true in all these cases. The probe works by spawning arecord with the same arguments as the real recording session and racing a 150ms timer:
const timer = setTimeout((child, resolve) => {
child.kill('SIGTERM')
resolve({ ok: true, stderr: '' }) // still alive = opened OK
}, 150, child, resolve)
child.once('close', code => {
clearTimeout(timer)
resolve({ ok: code === 0, stderr: stderr.trim() }) // exited early = failed
})
The result is memoized — audio device availability does not change mid-session, and this runs on every voice keypress via checkRecordingAvailability().
const args = [
'-q', // quiet: no progress output
'--buffer', '1024', // flush in small chunks; without this SoX buffers seconds
'-t', 'raw', // raw PCM, no WAV header
'-r', '16000', // 16kHz sample rate (matches STT endpoint requirement)
'-e', 'signed', // signed PCM
'-b', '16', // 16-bit depth
'-c', '1', // mono
'-', // write to stdout
]
// Silence detection (only when NOT in push-to-talk mode)
if (useSilenceDetection) {
args.push('silence', '1', '0.1', '3%', '1', '2.0', '3%')
// ↑ stop after 2 seconds of audio below 3% threshold
}
Push-to-talk passes { silenceDetection: false } — the user controls start and stop. The native NAPI module also ignores its built-in onEnd in push-to-talk mode.
The STT client connects to wss://api.anthropic.com/api/ws/speech_to_text/voice_stream. It uses the same OAuth Bearer token as the rest of Claude Code. The URL includes query parameters that configure the STT session:
const params = new URLSearchParams({
encoding: 'linear16', // 16-bit signed PCM
sample_rate: '16000',
channels: '1', // mono
endpointing_ms: '300', // endpoint detection window
utterance_end_ms:'1000',
language: options?.language ?? 'en',
})
Buffer.from() to prevent NAPI shared-ArrayBuffer races.setTimeout(0) to flush any queued NAPI callbacks first.isFinal=false.TranscriptText to isFinal=true. After CloseStream, resolves finalize() fast (~300 ms).onError.The claude.ai Cloudflare zone uses TLS fingerprinting and blocks non-browser clients (JA3 fingerprint mismatch). The api.anthropic.com listener exposes the same private-api pod with the same OAuth auth but is on a CF zone that does not enforce browser-class TLS fingerprinting. Desktop dictation still uses claude.ai because Swift's URLSession has a browser-class JA3 fingerprint and passes the challenge.
// Override via env var for testing/staging
const wsBaseUrl =
process.env.VOICE_STREAM_BASE_URL ||
getOauthConfig().BASE_API_URL
.replace('https://', 'wss://')
.replace('http://', 'ws://')
finalize() returns a Promise<FinalizeSource> that resolves via whichever of four paths fires first:
| Source | Condition | Typical latency |
|---|---|---|
post_closestream_endpoint | TranscriptEndpoint arrives after CloseStream was sent | ~300 ms |
no_data_timeout | No TranscriptText arrived after CloseStream (1.5 s) | 1.5 s |
ws_close | WebSocket close event fires | 3–5 s |
safety_timeout | Last-resort cap | 5 s |
The no_data_timeout path is the silent-drop signature — if it fires with hadAudioSignal=true, the session hit a known server bug (sticky CE pod returning zero transcripts, ~1% of sessions).
Terminal key events arrive as a stream: one event on initial press, then auto-repeat events every 30–80 ms while held. There is no "keyup" event in a terminal. The system reconstructs hold by timing gaps between events.
// In useVoiceIntegration.tsx
const RAPID_KEY_GAP_MS = 120 // auto-repeat fires every 30-80ms; 120ms covers jitter
const HOLD_THRESHOLD = 5 // 5 rapid presses required before activating voice
const WARMUP_THRESHOLD = 2 // show "keep holding…" feedback at press #2
Modifier-combo bindings (e.g., Ctrl+Space) activate on the first press — no warmup required, because a modifier combo is unambiguously intentional.
While the user holds Space for warmup, some space characters leak into the text input before stopImmediatePropagation() takes effect (listener registration order is not guaranteed). stripTrailing() removes exactly the leakage count without touching pre-existing spaces at the boundary:
// Strip exactly `maxStrip` trailing `char` chars, leaving `floor` behind
const stripTrailing = (maxStrip, { char = ' ', anchor = false, floor = 0 } = {}) => {
// Also counts full-width spaces (U+3000) for CJK IME compatibility
const scan = char === ' '
? normalizeFullWidthSpace(beforeCursor)
: beforeCursor
// ...
if (anchor) {
voicePrefixRef.current = stripped // save text before cursor
voiceSuffixRef.current = afterCursor // save text after cursor
}
}
When anchor=true, the call also captures the cursor position for interim transcript injection. The gap space inserted between prefix and suffix ensures the waveform cursor sits on the gap rather than the first suffix letter.
// In useVoice.ts
const RELEASE_TIMEOUT_MS = 200 // gap that signals key release (auto-repeat is 30-80ms)
const REPEAT_FALLBACK_MS = 600 // arm release timer if no auto-repeat seen yet
const FIRST_PRESS_FALLBACK_MS = 2000 // modifier combos: OS initial repeat delay up to ~2s
When no second keypress arrives within 600 ms, the fallback timer arms the release detection. For modifier combos, callers pass 2000 ms to cover the long OS initial repeat delay (macOS slider at "Long" = ~2 s before auto-repeat starts).
Each recording session moves through three states managed by useVoice:
updateState('recording') is called synchronously before any await in startRecordingSession(). useVoiceIntegration reads voiceState from the store immediately after void startRecordingSession() to gate whether leaked space keypresses should be swallowed. If an await ran first, the guard would see stale 'idle' and let spaces leak.
Approximately 1% of sessions hit a server-side bug (session-sticky CE pod that accepts audio but returns zero transcripts). The symptom: finalize() resolves via no_data_timeout despite real speech. The client detects this pattern and replays the full audio buffer on a fresh WebSocket once.
All six conditions must be true to trigger a replay:
finalizeSource === 'no_data_timeout'hadAudioSignal === true (non-trivial mic signal detected)wsConnected === true (WS did open — backend received audio)!focusTriggered (not a focus-mode session)accumulatedRef.current.trim() === '' (no partial transcript accumulated)!silentDropRetriedRef.current (replay only once per session)if (finalizeSource === 'no_data_timeout' && hadAudioSignal && wsConnected
&& !focusTriggered && focusFlushedChars === 0
&& accumulatedRef.current.trim() === ''
&& !silentDropRetriedRef.current
&& fullAudioRef.current.length > 0) {
silentDropRetriedRef.current = true
await sleep(250) // backoff to clear rapid-reconnect same-pod race
if (isStale()) return
// Replay full buffer in 32 KB slices on a fresh connection
const SLICE = 32_000
for (const chunk of replayBuffer) {
// ... batch into SLICE-sized sends ...
conn.send(Buffer.concat(slice))
}
await conn.finalize()
}
The audio buffer is bounded: fullAudioRef.current skips buffering in focus mode (where sessions can last minutes and the buffer could reach ~20 MB). The 32 KB slice size batches small NAPI chunks into a reasonable WS frame size without exceeding WS message limits.
Focus mode is a "multi-clauding army" workflow: recording starts when the terminal window gains focus and stops when it loses focus. Transcript chunks are flushed immediately (rather than accumulated) so continuous dictation across long sessions stays responsive.
| Behavior | Hold-to-talk | Focus mode |
|---|---|---|
| Trigger | Key hold | Terminal focus gain |
| Stop trigger | Key release (gap > 200 ms) | Terminal focus lost |
| Transcript delivery | Accumulated, injected on stop | Each final flushed immediately; anchor advanced |
| Silence timeout | None | 5 s (FOCUS_SILENCE_TIMEOUT_MS) — tears down session to free WS |
| Silent-drop replay | Yes | No (gated on !focusTriggered) |
| Audio buffer | Full buffer kept for replay | Skipped (dead weight in long sessions) |
// Arms / resets the silence timer after each flushed transcript
function armFocusSilenceTimer(): void {
if (focusSilenceTimerRef.current) clearTimeout(focusSilenceTimerRef.current)
focusSilenceTimerRef.current = setTimeout(() => {
if (stateRef.current === 'recording' && focusTriggeredRef.current) {
silenceTimedOutRef.current = true
finishRecording() // tears down WS gracefully
}
}, FOCUS_SILENCE_TIMEOUT_MS) // 5000 ms
}
normalizeLanguageForSTT() maps the user's settings.language string (which could be "Japanese", "日本語", "ja-JP", etc.) to a BCP-47 code from a hardcoded allowlist that is a subset of the server's speech_to_text_voice_stream_config GrowthBook allowlist. Sending an unsupported code closes the WebSocket with code 1008 "Unsupported language".
// Falls back to 'en' with a fellBackFrom warning if language is unsupported
export function normalizeLanguageForSTT(language): { code: string, fellBackFrom?: string } {
if (!language) return { code: 'en' }
const lower = language.toLowerCase().trim()
if (SUPPORTED_LANGUAGE_CODES.has(lower)) return { code: lower }
const fromName = LANGUAGE_NAME_TO_CODE[lower] // e.g. "japanese" → "ja"
if (fromName) return { code: fromName }
const base = lower.split('-')[0] // "ja-JP" → "ja"
if (SUPPORTED_LANGUAGE_CODES.has(base)) return { code: base }
return { code: 'en', fellBackFrom: language }
}
The getVoiceKeyterms() function builds a list of up to 50 domain-specific terms sent as keyterms query parameters. The STT backend applies boosting so that "MCP", "OAuth", "TypeScript", and project-specific vocabulary are correctly recognized.
Keyterms come from three sources, merged into a deduplicated Set:
claude-code added as a whole term.feat/voice-keyterms → ["feat", "voice", "keyterms"].export function splitIdentifier(name: string): string[] {
return name
.replace(/([a-z])([A-Z])/g, '$1 $2') // camelCase → camel Case
.split(/[-_./\s]+/) // split on separators
.filter(w => w.length > 2 && w.length <= 20) // discard noise
}
While recording, the prompt input shows a 16-bar waveform. Each new PCM chunk updates the rightmost bar by computing RMS amplitude from the raw 16-bit signed PCM buffer.
const AUDIO_LEVEL_BARS = 16
export function computeLevel(chunk: Buffer): number {
const samples = chunk.length >> 1 // 16-bit = 2 bytes per sample
if (samples === 0) return 0
let sumSq = 0
for (let i = 0; i < chunk.length - 1; i += 2) {
// Read 16-bit signed little-endian sample
const sample = ((chunk[i]! | (chunk[i+1]! << 8)) << 16) >> 16
sumSq += sample * sample
}
const rms = Math.sqrt(sumSq / samples)
const normalized = Math.min(rms / 2000, 1)
return Math.sqrt(normalized) // sqrt curve spreads quieter levels visually
}
sqrt(normalized) spreads quieter levels (0.0–0.5) across more bars, making the visualization responsive to normal conversational speech.
Voice state is stored in a custom Store<VoiceState> (not React state) held in a context. This enables useSyncExternalStore-based subscriptions that only re-render when the selected slice changes.
export type VoiceState = {
voiceState: 'idle' | 'recording' | 'processing'
voiceError: string | null
voiceInterimTranscript: string // live preview text shown in prompt
voiceAudioLevels: number[] // 16 bars, 0–1 normalized
voiceWarmingUp: boolean // show "keep holding…" hint
}
export function useVoiceEnabled(): boolean {
const userIntent = useAppState(s => s.settings.voiceEnabled === true)
const authVersion = useAppState(s => s.authVersion)
// authVersion bumps on /login only.
// getClaudeAIOAuthTokens() spawns `security` (~60ms cold) — can't call on every render.
const authed = useMemo(hasVoiceAuth, [authVersion])
return userIntent && authed && isVoiceGrowthBookEnabled()
}
The isVoiceGrowthBookEnabled() call stays outside the memo so a mid-session kill-switch flip takes effect on the next render without waiting for a login event.
The CE proxy can reject rapid reconnects (~1/N_pods same-pod collision), and Deepgram's upstream can fail during its own teardown window. These manifest as errors before any transcript arrives. The system retries once with a 250 ms backoff.
// Only retry if: not fatal (4xx), no transcript seen yet, still recording
if (!opts?.fatal && !sawTranscript && stateRef.current === 'recording') {
if (!retryUsedRef.current) {
retryUsedRef.current = true
connectionRef.current = null // null → audio re-buffers until new onReady
attemptGenRef.current++ // stale conn's trailing close is ignored
setTimeout(() => {
if (stateRef.current === 'recording') attemptConnect(keyterms)
}, 250)
return
}
}
// Fatal errors (4xx) surface the message to the user
fatal: true by the unexpected-response handler. Fatal errors are never retried — the same request will get the same rejection.
VOICE_MODE feature flag (compile-time dead-code elimination) and an Anthropic OAuth token are required. The kill-switch defaults to "not killed" so fresh installs work immediately. API keys, Bedrock, Vertex, and Foundry are excluded by design.startRecording() and checkRecordingAvailability() walk the same NAPI → arecord → SoX priority order. The memoized probeArecord() result ensures that if the availability check falls through to SoX (broken arecord), the recording call does too.audioBuffer[] until onReady fires. This eliminates the 1–2 s OAuth+WS connect latency from the user's perceived recording start.stripTrailing() cleans up leaked chars without disturbing pre-existing content at the cursor boundary.finalize() has four resolution triggers with different latencies. The fast path (post_closestream_endpoint, ~300 ms) is the normal case. no_data_timeout (1.5 s) is the silent-drop detector. Always capturing recordingDurationMs before the finalize() await prevents WebSocket teardown time from inflating the metric.updateState('recording') run synchronously before any await in startRecordingSession()?useVoiceIntegration's hold-detection code would see stale 'idle' and fail to swallow auto-repeat spaces, causing spaces to leak into the prompt input.arecord and sox installed, what determines which backend is actually used?hasCommand('arecord') only checks PATH. On headless Linux, arecord exists but open() immediately fails with no ALSA card. The 150 ms race detects this: if arecord exits before the timer fires, probe.ok = false and SoX is used instead. This decision is memoized for the session.no_data_timeout resolution source of finalize() designed to detect?no_data_timeout fires with hadAudioSignal=true and wsConnected=true, it means audio reached the backend but no transcript came back — the signature of the ~1% silent-drop bug. The client then replays the full audio buffer on a fresh WebSocket connection.connectVoiceStream() target api.anthropic.com rather than claude.ai?fullAudioRef) skipped in focus mode?!focusTriggered, the buffer serves no purpose in focus mode and is a waste of memory.normalizeLanguageForSTT() receives an unsupported language like "Swahili"?{ code: 'en', fellBackFrom: 'Swahili' }. The /voice command handler checks stt.fellBackFrom and appends a note like "Swahili is not a supported dictation language; using English" to the enable confirmation message.