markdown.engineering
Lesson 27

The Voice System

Push-to-talk, WebSocket STT, multi-backend audio capture, and resilience patterns in Claude Code.

1. What the voice system does

Claude Code ships a push-to-talk voice input pipeline that converts spoken speech into prompt text. Hold a key, speak, release — the transcript lands in the prompt input exactly where your cursor was. The system is gated behind two independent guards: a GrowthBook feature flag (VOICE_MODE) and an Anthropic OAuth token. Neither alone is sufficient.

High-level flow
  ┌──────────────┐  keypress   ┌───────────────────┐  PCM chunks   ┌──────────────────────┐
  │ useVoiceInt- │ ──────────► │   useVoice.ts      │ ────────────► │  voiceStreamSTT.ts   │
  │ egration.tsx │             │  (hold detection)  │               │  (WebSocket STT)     │
  └──────────────┘             └────────┬──────────┘               └──────────┬───────────┘
                                        │ startRecording()                     │ TranscriptText
                               ┌────────▼──────────┐               ┌──────────▼───────────┐
                               │   voice.ts         │               │ Anthropic voice_stream│
                               │  (audio backend)   │               │  /api/ws/speech_to_  │
                               │  NAPI / arecord    │               │   text/voice_stream  │
                               │       / SoX        │               └──────────────────────┘
                               └───────────────────┘
  

The four key files:

2. Feature gating and auth

voice/ voiceModeEnabled.ts
commands/voice/ voice.ts

Three functions form a layered gate:

isVoiceGrowthBookEnabled() — the kill-switch
export function isVoiceGrowthBookEnabled(): boolean {
  // feature('VOICE_MODE') is a compile-time constant (Bun bundler)
  // Dead code is eliminated in non-ANT builds — keeps binary size down.
  return feature('VOICE_MODE')
    ? !getFeatureValue_CACHED_MAY_BE_STALE('tengu_amber_quartz_disabled', false)
    : false
}

The GrowthBook flag tengu_amber_quartz_disabled defaults to false (not disabled). A missing or stale disk cache reads as "not killed" — so fresh installs work immediately. Flipping the flag to true emergency-disables voice for all users within GrowthBook's next cache refresh cycle.

hasVoiceAuth() — OAuth token check
export function hasVoiceAuth(): boolean {
  if (!isAnthropicAuthEnabled()) return false
  const tokens = getClaudeAIOAuthTokens()
  return Boolean(tokens?.accessToken)
}

getClaudeAIOAuthTokens() spawns security on macOS (~20-50ms cold, cached afterward). The memoize clears on token refresh (~once per hour), so one cold spawn per session is expected. API keys, Bedrock, Vertex, and Foundry all return false here — voice_stream is only available via Claude.ai OAuth.

/voice command preflight checks

Enabling voice runs five sequential checks before writing voiceEnabled: true to settings:

  1. isVoiceModeEnabled() — auth + GB gate.
  2. checkRecordingAvailability() — rejects remote environments (CLAUDE_CODE_REMOTE, Homespace) immediately; then probes the audio backend chain.
  3. isVoiceStreamAvailable() — re-checks OAuth token freshness.
  4. checkVoiceDependencies() — verifies at least one recording tool is usable.
  5. requestMicrophonePermission() — fires the OS TCC dialog now (macOS) rather than on first keypress. Provides platform-specific guidance on denial.
// Fires TCC dialog early — better than a surprise on first hold-to-talk
if (!(await requestMicrophonePermission())) {
  const guidance = process.platform === 'darwin'
    ? 'System Settings → Privacy & Security → Microphone'
    : 'your system\'s audio settings'
  return { type: 'text', value: `Microphone access denied. Go to ${guidance}.` }
}

3. Audio recording backends

services/ voice.ts

The recording layer presents a single startRecording(onData, onEnd, options?) interface over three possible backends, tried in priority order:

Backend Platforms Notes Status
audio-capture-napi macOS, Linux (ALSA cards present), Windows In-process via cpal + CoreAudio/AudioUnit. dlopen blocks ~1s warm, ~8s cold on macOS — loaded lazily on first voice keypress only. Primary
arecord Linux only ALSA userspace utility. Probed via 150ms race: if still alive = device opened OK. Handles WSL2+WSLg (Win11) via PulseAudio RDP pipes; fails on WSL1 / Win10-WSL2. Fallback 1
SoX (rec) macOS, Linux External process piping raw PCM. Requires --buffer 1024 to prevent internal buffering delay. Has built-in silence detection; arecord does not. Fallback 2
Windows (no native) Windows only No subprocess fallback on Windows. Native module required. No fallback
Why arecord needs a runtime probe, not just PATH check

On WSL1 and headless Linux, arecord is installed but open() fails immediately because there is no ALSA card and no PulseAudio server. hasCommand('arecord') returns true in all these cases. The probe works by spawning arecord with the same arguments as the real recording session and racing a 150ms timer:

const timer = setTimeout((child, resolve) => {
  child.kill('SIGTERM')
  resolve({ ok: true, stderr: '' })   // still alive = opened OK
}, 150, child, resolve)

child.once('close', code => {
  clearTimeout(timer)
  resolve({ ok: code === 0, stderr: stderr.trim() }) // exited early = failed
})

The result is memoized — audio device availability does not change mid-session, and this runs on every voice keypress via checkRecordingAvailability().

SoX argument details and silence detection
const args = [
  '-q',               // quiet: no progress output
  '--buffer', '1024', // flush in small chunks; without this SoX buffers seconds
  '-t', 'raw',         // raw PCM, no WAV header
  '-r', '16000',      // 16kHz sample rate (matches STT endpoint requirement)
  '-e', 'signed',     // signed PCM
  '-b', '16',         // 16-bit depth
  '-c', '1',          // mono
  '-',                // write to stdout
]
// Silence detection (only when NOT in push-to-talk mode)
if (useSilenceDetection) {
  args.push('silence', '1', '0.1', '3%', '1', '2.0', '3%')
  // ↑ stop after 2 seconds of audio below 3% threshold
}

Push-to-talk passes { silenceDetection: false } — the user controls start and stop. The native NAPI module also ignores its built-in onEnd in push-to-talk mode.

4. WebSocket STT protocol

services/ voiceStreamSTT.ts

The STT client connects to wss://api.anthropic.com/api/ws/speech_to_text/voice_stream. It uses the same OAuth Bearer token as the rest of Claude Code. The URL includes query parameters that configure the STT session:

const params = new URLSearchParams({
  encoding:        'linear16',   // 16-bit signed PCM
  sample_rate:     '16000',
  channels:        '1',          // mono
  endpointing_ms:  '300',        // endpoint detection window
  utterance_end_ms:'1000',
  language:        options?.language ?? 'en',
})

Wire protocol messages

KeepAlive
JSON control. Sent immediately on open (prevents server close before audio starts), then every 8 s.
binary frame
Raw PCM audio chunks from the recording backend. Copied via Buffer.from() to prevent NAPI shared-ArrayBuffer races.
CloseStream
JSON control. Signals end of audio. Sent in a setTimeout(0) to flush any queued NAPI callbacks first.
TranscriptText
Interim transcript chunk. May be revised (shorter or longer) by subsequent messages. Emitted to caller as isFinal=false.
TranscriptEndpoint
Signals utterance boundary. Promotes the last TranscriptText to isFinal=true. After CloseStream, resolves finalize() fast (~300 ms).
TranscriptError
STT error (e.g., unsupported language closes with code 1008). Forwarded to caller's onError.
Why api.anthropic.com instead of claude.ai

The claude.ai Cloudflare zone uses TLS fingerprinting and blocks non-browser clients (JA3 fingerprint mismatch). The api.anthropic.com listener exposes the same private-api pod with the same OAuth auth but is on a CF zone that does not enforce browser-class TLS fingerprinting. Desktop dictation still uses claude.ai because Swift's URLSession has a browser-class JA3 fingerprint and passes the challenge.

// Override via env var for testing/staging
const wsBaseUrl =
  process.env.VOICE_STREAM_BASE_URL ||
  getOauthConfig().BASE_API_URL
    .replace('https://', 'wss://')
    .replace('http://', 'ws://')
finalize() — four resolution triggers

finalize() returns a Promise<FinalizeSource> that resolves via whichever of four paths fires first:

SourceConditionTypical latency
post_closestream_endpointTranscriptEndpoint arrives after CloseStream was sent~300 ms
no_data_timeoutNo TranscriptText arrived after CloseStream (1.5 s)1.5 s
ws_closeWebSocket close event fires3–5 s
safety_timeoutLast-resort cap5 s

The no_data_timeout path is the silent-drop signature — if it fires with hadAudioSignal=true, the session hit a known server bug (sticky CE pod returning zero transcripts, ~1% of sessions).

5. Hold-to-talk mechanics and hold threshold

hooks/ useVoiceIntegration.tsx

Terminal key events arrive as a stream: one event on initial press, then auto-repeat events every 30–80 ms while held. There is no "keyup" event in a terminal. The system reconstructs hold by timing gaps between events.

The hold threshold problem: a bare-character binding like Space could be a normal typed space or the start of a hold. Requiring N rapid consecutive presses before activating voice prevents accidental triggers during normal typing.
// In useVoiceIntegration.tsx
const RAPID_KEY_GAP_MS = 120   // auto-repeat fires every 30-80ms; 120ms covers jitter
const HOLD_THRESHOLD = 5       // 5 rapid presses required before activating voice
const WARMUP_THRESHOLD = 2     // show "keep holding…" feedback at press #2

Modifier-combo bindings (e.g., Ctrl+Space) activate on the first press — no warmup required, because a modifier combo is unambiguously intentional.

stripTrailing() — cleaning up leaked spaces

While the user holds Space for warmup, some space characters leak into the text input before stopImmediatePropagation() takes effect (listener registration order is not guaranteed). stripTrailing() removes exactly the leakage count without touching pre-existing spaces at the boundary:

// Strip exactly `maxStrip` trailing `char` chars, leaving `floor` behind
const stripTrailing = (maxStrip, { char = ' ', anchor = false, floor = 0 } = {}) => {
  // Also counts full-width spaces (U+3000) for CJK IME compatibility
  const scan = char === ' '
    ? normalizeFullWidthSpace(beforeCursor)
    : beforeCursor
  // ...
  if (anchor) {
    voicePrefixRef.current = stripped         // save text before cursor
    voiceSuffixRef.current = afterCursor      // save text after cursor
  }
}

When anchor=true, the call also captures the cursor position for interim transcript injection. The gap space inserted between prefix and suffix ensures the waveform cursor sits on the gap rather than the first suffix letter.

Release detection: RELEASE_TIMEOUT_MS and REPEAT_FALLBACK_MS
// In useVoice.ts
const RELEASE_TIMEOUT_MS = 200    // gap that signals key release (auto-repeat is 30-80ms)
const REPEAT_FALLBACK_MS = 600    // arm release timer if no auto-repeat seen yet
const FIRST_PRESS_FALLBACK_MS = 2000 // modifier combos: OS initial repeat delay up to ~2s

When no second keypress arrives within 600 ms, the fallback timer arms the release detection. For modifier combos, callers pass 2000 ms to cover the long OS initial repeat delay (macOS slider at "Long" = ~2 s before auto-repeat starts).

6. Session state machine

Each recording session moves through three states managed by useVoice:

State
idle
No recording active. WS closed. Prompt input normal.
State
recording
Audio capturing. PCM streaming to WS. Interim transcript shown in prompt.
State
processing
Key released. Audio stopped. Waiting for finalize() and final transcript.
Critical implementation detail: updateState('recording') is called synchronously before any await in startRecordingSession(). useVoiceIntegration reads voiceState from the store immediately after void startRecordingSession() to gate whether leaked space keypresses should be swallowed. If an await ran first, the guard would see stale 'idle' and let spaces leak.
Full session lifecycle timeline
Hold threshold reached (press #5)
stripTrailing with anchor=true captures prefix/suffix. State → recording synchronously. sessionGenRef++.
checkRecordingAvailability() await
Probes backend chain (memoized after first call).
startRecording() + connectVoiceStream() — parallel
Audio capture starts immediately. PCM chunks buffer in audioBuffer[] until WS opens. Keyterms gathered (git branch, recent files) before WS connect.
onReady fires (WS open)
audioBuffer flushed to WS. Subsequent chunks sent directly. KeepAlive starts (8 s interval).
TranscriptText messages arrive
Interim shown in prompt at cursor position. Non-cumulative new segments auto-finalized (legacy Deepgram only).
Key released (gap > 200 ms)
stopRecording(). State → processing. finalize() called. recordingDurationMs captured before the await.
finalize() resolves
Typically via TranscriptEndpoint (~300 ms) or no_data_timeout (1.5 s).
onTranscript(text) called
Final transcript injected at cursor anchor. State → idle.

7. Silent-drop replay

hooks/ useVoice.ts

Approximately 1% of sessions hit a server-side bug (session-sticky CE pod that accepts audio but returns zero transcripts). The symptom: finalize() resolves via no_data_timeout despite real speech. The client detects this pattern and replays the full audio buffer on a fresh WebSocket once.

Silent-drop detection conditions and replay code

All six conditions must be true to trigger a replay:

  1. finalizeSource === 'no_data_timeout'
  2. hadAudioSignal === true (non-trivial mic signal detected)
  3. wsConnected === true (WS did open — backend received audio)
  4. !focusTriggered (not a focus-mode session)
  5. accumulatedRef.current.trim() === '' (no partial transcript accumulated)
  6. !silentDropRetriedRef.current (replay only once per session)
if (finalizeSource === 'no_data_timeout' && hadAudioSignal && wsConnected
    && !focusTriggered && focusFlushedChars === 0
    && accumulatedRef.current.trim() === ''
    && !silentDropRetriedRef.current
    && fullAudioRef.current.length > 0) {
  silentDropRetriedRef.current = true
  await sleep(250)          // backoff to clear rapid-reconnect same-pod race
  if (isStale()) return

  // Replay full buffer in 32 KB slices on a fresh connection
  const SLICE = 32_000
  for (const chunk of replayBuffer) {
    // ... batch into SLICE-sized sends ...
    conn.send(Buffer.concat(slice))
  }
  await conn.finalize()
}

The audio buffer is bounded: fullAudioRef.current skips buffering in focus mode (where sessions can last minutes and the buffer could reach ~20 MB). The 32 KB slice size batches small NAPI chunks into a reasonable WS frame size without exceeding WS message limits.

8. Focus mode

Focus mode is a "multi-clauding army" workflow: recording starts when the terminal window gains focus and stops when it loses focus. Transcript chunks are flushed immediately (rather than accumulated) so continuous dictation across long sessions stays responsive.

Focus mode differences from hold-to-talk
BehaviorHold-to-talkFocus mode
TriggerKey holdTerminal focus gain
Stop triggerKey release (gap > 200 ms)Terminal focus lost
Transcript deliveryAccumulated, injected on stopEach final flushed immediately; anchor advanced
Silence timeoutNone5 s (FOCUS_SILENCE_TIMEOUT_MS) — tears down session to free WS
Silent-drop replayYesNo (gated on !focusTriggered)
Audio bufferFull buffer kept for replaySkipped (dead weight in long sessions)
// Arms / resets the silence timer after each flushed transcript
function armFocusSilenceTimer(): void {
  if (focusSilenceTimerRef.current) clearTimeout(focusSilenceTimerRef.current)
  focusSilenceTimerRef.current = setTimeout(() => {
    if (stateRef.current === 'recording' && focusTriggeredRef.current) {
      silenceTimedOutRef.current = true
      finishRecording()           // tears down WS gracefully
    }
  }, FOCUS_SILENCE_TIMEOUT_MS)   // 5000 ms
}

9. Language normalization and keyterms

hooks/ useVoice.ts   services/ voiceKeyterms.ts

Language normalization

normalizeLanguageForSTT() maps the user's settings.language string (which could be "Japanese", "日本語", "ja-JP", etc.) to a BCP-47 code from a hardcoded allowlist that is a subset of the server's speech_to_text_voice_stream_config GrowthBook allowlist. Sending an unsupported code closes the WebSocket with code 1008 "Unsupported language".

// Falls back to 'en' with a fellBackFrom warning if language is unsupported
export function normalizeLanguageForSTT(language): { code: string, fellBackFrom?: string } {
  if (!language) return { code: 'en' }
  const lower = language.toLowerCase().trim()
  if (SUPPORTED_LANGUAGE_CODES.has(lower)) return { code: lower }
  const fromName = LANGUAGE_NAME_TO_CODE[lower]   // e.g. "japanese" → "ja"
  if (fromName) return { code: fromName }
  const base = lower.split('-')[0]              // "ja-JP" → "ja"
  if (SUPPORTED_LANGUAGE_CODES.has(base)) return { code: base }
  return { code: 'en', fellBackFrom: language }
}

Keyterms (STT boosting)

The getVoiceKeyterms() function builds a list of up to 50 domain-specific terms sent as keyterms query parameters. The STT backend applies boosting so that "MCP", "OAuth", "TypeScript", and project-specific vocabulary are correctly recognized.

Keyterm sources and identifier splitting

Keyterms come from three sources, merged into a deduplicated Set:

  1. Global hardcoded terms: MCP, symlink, grep, regex, localhost, TypeScript, OAuth, webhook, gRPC, dotfiles, subagent, worktree. (Claude and Anthropic are already base keyterms on the server.)
  2. Project root basename: e.g., claude-code added as a whole term.
  3. Git branch words: feat/voice-keyterms → ["feat", "voice", "keyterms"].
export function splitIdentifier(name: string): string[] {
  return name
    .replace(/([a-z])([A-Z])/g, '$1 $2')  // camelCase → camel Case
    .split(/[-_./\s]+/)                    // split on separators
    .filter(w => w.length > 2 && w.length <= 20) // discard noise
}

10. Audio level visualization and RMS computation

While recording, the prompt input shows a 16-bar waveform. Each new PCM chunk updates the rightmost bar by computing RMS amplitude from the raw 16-bit signed PCM buffer.

const AUDIO_LEVEL_BARS = 16

export function computeLevel(chunk: Buffer): number {
  const samples = chunk.length >> 1   // 16-bit = 2 bytes per sample
  if (samples === 0) return 0
  let sumSq = 0
  for (let i = 0; i < chunk.length - 1; i += 2) {
    // Read 16-bit signed little-endian sample
    const sample = ((chunk[i]! | (chunk[i+1]! << 8)) << 16) >> 16
    sumSq += sample * sample
  }
  const rms = Math.sqrt(sumSq / samples)
  const normalized = Math.min(rms / 2000, 1)
  return Math.sqrt(normalized)   // sqrt curve spreads quieter levels visually
}
The sqrt curve is intentional. A linear scale compresses most speech energy into the top 20% of the visual range — the waveform looks flat except for loud peaks. Taking sqrt(normalized) spreads quieter levels (0.0–0.5) across more bars, making the visualization responsive to normal conversational speech.

11. React context layer

context/ voice.tsx   hooks/ useVoiceEnabled.ts

Voice state is stored in a custom Store<VoiceState> (not React state) held in a context. This enables useSyncExternalStore-based subscriptions that only re-render when the selected slice changes.

export type VoiceState = {
  voiceState:             'idle' | 'recording' | 'processing'
  voiceError:             string | null
  voiceInterimTranscript: string   // live preview text shown in prompt
  voiceAudioLevels:       number[] // 16 bars, 0–1 normalized
  voiceWarmingUp:         boolean  // show "keep holding…" hint
}
useVoiceEnabled — why auth is memoized on authVersion
export function useVoiceEnabled(): boolean {
  const userIntent   = useAppState(s => s.settings.voiceEnabled === true)
  const authVersion  = useAppState(s => s.authVersion)
  // authVersion bumps on /login only.
  // getClaudeAIOAuthTokens() spawns `security` (~60ms cold) — can't call on every render.
  const authed = useMemo(hasVoiceAuth, [authVersion])
  return userIntent && authed && isVoiceGrowthBookEnabled()
}

The isVoiceGrowthBookEnabled() call stays outside the memo so a mid-session kill-switch flip takes effect on the next render without waiting for a login event.

12. Early-error retry

The CE proxy can reject rapid reconnects (~1/N_pods same-pod collision), and Deepgram's upstream can fail during its own teardown window. These manifest as errors before any transcript arrives. The system retries once with a 250 ms backoff.

// Only retry if: not fatal (4xx), no transcript seen yet, still recording
if (!opts?.fatal && !sawTranscript && stateRef.current === 'recording') {
  if (!retryUsedRef.current) {
    retryUsedRef.current = true
    connectionRef.current = null       // null → audio re-buffers until new onReady
    attemptGenRef.current++             // stale conn's trailing close is ignored
    setTimeout(() => {
      if (stateRef.current === 'recording') attemptConnect(keyterms)
    }, 250)
    return
  }
}
// Fatal errors (4xx) surface the message to the user
Fatal vs. transient errors: 4xx HTTP upgrade rejections (Cloudflare bot challenge, auth rejection) are marked fatal: true by the unexpected-response handler. Fatal errors are never retried — the same request will get the same rejection.

Key Takeaways

  1. Voice is double-gated. Both the GrowthBook VOICE_MODE feature flag (compile-time dead-code elimination) and an Anthropic OAuth token are required. The kill-switch defaults to "not killed" so fresh installs work immediately. API keys, Bedrock, Vertex, and Foundry are excluded by design.
  2. The backend fallback chain matches the availability check chain. startRecording() and checkRecordingAvailability() walk the same NAPI → arecord → SoX priority order. The memoized probeArecord() result ensures that if the availability check falls through to SoX (broken arecord), the recording call does too.
  3. Audio starts before the WebSocket opens. PCM chunks buffer in audioBuffer[] until onReady fires. This eliminates the 1–2 s OAuth+WS connect latency from the user's perceived recording start.
  4. State transitions to 'recording' synchronously before any await. Any async work before this transition would let the hold-detection code see stale 'idle' and allow auto-repeat key characters to leak into the text input.
  5. The hold threshold (5 rapid presses) prevents accidental activation for bare-character bindings. Modifier combos bypass the threshold entirely. stripTrailing() cleans up leaked chars without disturbing pre-existing content at the cursor boundary.
  6. The silent-drop replay is a client-side workaround for a server-side bug. The full audio buffer is kept specifically for this one-shot replay. Focus mode skips buffering to avoid multi-MB accumulation in long sessions.
  7. finalize() has four resolution triggers with different latencies. The fast path (post_closestream_endpoint, ~300 ms) is the normal case. no_data_timeout (1.5 s) is the silent-drop detector. Always capturing recordingDurationMs before the finalize() await prevents WebSocket teardown time from inflating the metric.
  8. Focus mode is architecturally different from hold-to-talk. It disables audio buffering, replaces accumulation with immediate flush, and uses a 5-second silence timer instead of key release as the stop trigger.

Quiz

1. Why does updateState('recording') run synchronously before any await in startRecordingSession()?
  • A To make the animation frame update immediately
  • B Because useVoiceIntegration reads voiceState synchronously after void startRecordingSession() to decide whether to swallow auto-repeat spaces
  • C To prevent the release timer from firing before recording starts
  • D Because React's concurrent mode requires state updates before async boundaries
If any await ran first, useVoiceIntegration's hold-detection code would see stale 'idle' and fail to swallow auto-repeat spaces, causing spaces to leak into the prompt input.
2. On headless Linux with both arecord and sox installed, what determines which backend is actually used?
  • A Whichever was installed first in PATH
  • B A compile-time constant
  • C The runtime probeArecord() result — if arecord can't open a device (exits before 150 ms), it falls through to SoX
  • D The user's audioBackend setting in settings.json
hasCommand('arecord') only checks PATH. On headless Linux, arecord exists but open() immediately fails with no ALSA card. The 150 ms race detects this: if arecord exits before the timer fires, probe.ok = false and SoX is used instead. This decision is memoized for the session.
3. What is the no_data_timeout resolution source of finalize() designed to detect?
  • A The user held the key but did not speak
  • B The WebSocket upgrade was rejected by Cloudflare
  • C A session-sticky CE pod that accepted audio but returned zero transcripts (the silent-drop bug)
  • D The OAuth token expired mid-session
When no_data_timeout fires with hadAudioSignal=true and wsConnected=true, it means audio reached the backend but no transcript came back — the signature of the ~1% silent-drop bug. The client then replays the full audio buffer on a fresh WebSocket connection.
4. Why does connectVoiceStream() target api.anthropic.com rather than claude.ai?
  • A Lower latency routing
  • B Different authentication scheme on claude.ai
  • C The claude.ai Cloudflare zone blocks non-browser TLS fingerprints; api.anthropic.com exposes the same pod without that restriction
  • D The voice_stream endpoint is not available on claude.ai
Cloudflare TLS fingerprinting (JA3) on the claude.ai zone challenges Node.js / Bun WebSocket connections. The api.anthropic.com listener exposes the same private-api pod with the same OAuth Bearer auth but is on a CF zone that doesn't enforce browser-class fingerprints.
5. Why is the full audio buffer (fullAudioRef) skipped in focus mode?
  • A Focus mode uses a different STT provider that doesn't support replay
  • B Focus mode sessions can last minutes, making the buffer potentially 20+ MB; and replay is gated on !focusTriggered anyway
  • C The NAPI backend doesn't support buffering in focus mode
  • D Focus mode always gets a TranscriptEndpoint before CloseStream
At 32 KB/s PCM, a 10-minute focus session would accumulate ~20 MB. Since replay is explicitly gated on !focusTriggered, the buffer serves no purpose in focus mode and is a waste of memory.
6. What happens when normalizeLanguageForSTT() receives an unsupported language like "Swahili"?
  • A Voice mode is disabled and the user sees an error
  • B The WebSocket connects but closes with code 1008
  • C It falls back to 'en' and sets fellBackFrom: 'Swahili', which surfaces a warning in the /voice toggle response
  • D It sends the raw string "Swahili" to the STT endpoint
The function returns { code: 'en', fellBackFrom: 'Swahili' }. The /voice command handler checks stt.fellBackFrom and appends a note like "Swahili is not a supported dictation language; using English" to the enable confirmation message.