Docs / Features / Audio input

Audio input

Picking your microphone, tuning the input level, filtering background noise, and setting the language Whisper transcribes.

Choosing a microphone

Settings → Audio → Microphone lists every input device Windows knows about. The default option is “System default” — whatever Windows currently has set as the default recording device. If you have several mics (built-in laptop mic, USB mic, headset, webcam), pick the one you want explicitly so Ditto doesn’t switch when Windows reshuffles defaults.

Microphone gain

Some mics record too quietly, some too hot. Settings → Audio → Microphone gain lets you scale the input volume in software:

The gain is applied in the renderer before audio gets sent to Whisper, so it directly affects how Whisper hears you. If your transcriptions are inconsistent, try moving the slider in 10% steps and re-record.

Noise filter

Settings → Audio → Noise filter turns on Chromium’s built-in noise suppression. It runs in real time, before recording, on the audio track itself.

It’s good at removing constant background noise:

It’s less effective at sudden noises (door closing, dog barking, someone shouting). For those, recording quality + a fixed language hint to Whisper do more than the filter does.

Transcription language

Settings → Audio → Language sets which language Whisper expects to hear. This is the content language — what you say into the mic — not the language of Ditto’s UI. (UI language is a separate setting in Settings → General. See Languages for the difference.)

Two modes:

Ditto supports the same eight languages here as in the UI: English, Spanish, French, German, Italian, Portuguese, Japanese, Chinese.

Translate to English

Settings → Audio → Translate to English changes Whisper’s job. Instead of transcribing what you said literally, it translates it to English in one step.

Some examples of what this means in practice:

Use cases:

This works best with Medium or Large-v3 — translation is harder than transcription, and smaller models can produce stilted output.