Why We Built The Orb - Voice Capture Reimagined
The story behind our global voice capture system. How we built a system that lets you press Fn from any app and get perfectly formatted text.
The Problem with Voice Capture
Voice capture tools suffer from a fundamental UX problem: they require you to leave your context. Click a microphone icon in some web app, or switch to a dedicated recording app, and you've broken your flow. Even worse, most voice tools send your audio to the cloud for processing—a privacy nonstarter for many use cases.
We wanted something different. A system that works from anywhere, keeps your data local, and feels instantaneous. Press a key, speak, release, and have text appear. No context switching, no cloud roundtrip, no privacy concerns.
Enter The Orb
The Orb is our global voice capture system. It's deceptively simple: hold the Fn key to record, release to transcribe. But under the hood, it's a coordination between several subsystems that had to be engineered from scratch.
Global Keyboard Monitoring
The first challenge: detecting the Fn key globally. Most apps can only see keyboard events when they have focus. We needed to capture Fn presses from anywhere—even when Kollabor isn't the active app.
On macOS, this meant diving into native APIs. We use NSEvent.addGlobalMonitorForEventsMatchingMask with the NSFlagsChangedMask to detect modifier flag changes. The Fn key has keyCode 63, but we also check the NSEventModifierFlagFunction (bit 23 in the modifier flags) because macOS may send different keyCodes on release.
We maintain an AtomicBool to track press state and emit Tauri events (fn-key-pressed, fn-key-released) that the frontend listens to. This requires Accessibility permissions, which we prompt for on first launch.
The NSPanel Overlay
When you press Fn, we need to show visual feedback—a pulsing orb that indicates recording is active. But here's the catch: regular NSWindow instances cannot appear above fullscreen apps on macOS. If you're in a fullscreen IDE or presentation, a normal window won't show.
The solution is NSPanel—a special window type designed for floating palettes. We use the tauri-nspanel crate to define our RecordingOrb panel with is_floating_panel: true and can_become_key_window: true. This ensures the orb appears above everything, even fullscreen apps.
Audio Recording Pipeline
Once Fn is pressed, we start capturing audio. We use cpal (Cross-platform Audio Library) to access the default input device and build an input stream. The stream captures audio samples in real-time and writes them to a WAV file using hound.
There's a threading consideration here: cpal::Stream isn't Send + Sync because its callbacks capture non-thread-safe closures. We can't store it directly in Tauri's AppState. Instead, we spawn a dedicated thread that owns the stream and WAV writer, then use channels to signal when to stop. This ensures proper RAII cleanup—the WAV file is finalized before the thread exits.
Local Whisper Transcription
When you release Fn, the recording stops and the audio file is finalized. Now we need to transcribe it. We use OpenAI's Whisper model, but here's the key: it runs entirely locally on your machine via whisper.cpp bindings (the whisper-rs crate).
No cloud API calls. Your voice never leaves your device. We support multiple model sizes (tiny, base, small, medium, large) so you can trade off accuracy for speed. The transcription happens in the background, and when it's complete, the text is emitted to the frontend.
We store transcriptions in ~/.kollabor/transcription-history.json so you can revisit past recordings. Each entry includes the text, duration, timestamp, and which model was used.
AI Enhancement
Raw transcription is useful, but we go further. After Whisper converts speech to text, you can send it through an AI model for enhancement—fixing grammar, restructuring sentences, adding formatting, or extracting action items. This happens through our chat system, which supports multiple providers (Anthropic, OpenAI-compatible, etc.).
The combination is powerful: capture voice with Fn, get instant local transcription, then optionally refine with AI. All without leaving your workflow.
The UX Flow
Put it all together and the experience is seamless:
- You're in any app—browser, IDE, terminal
- Hold Fn key (detected via global NSEvent monitor)
- NSPanel orb appears, pulsing to indicate recording
- Speak your thoughts
- Release Fn key
- Audio stops, WAV finalizes
- Whisper transcribes locally
- Text appears, ready to paste or enhance
No clicks, no context switches, no cloud. Just hold-speak-release.
Why Local Matters
Building local-first wasn't just a privacy choice—it was a latency choice. Cloud transcription adds network roundtrips (hundreds of milliseconds at best). Local Whisper on a modern Mac completes in near real-time. There's no API rate limiting, no per-minute costs, and no risk of a service going down.
For sensitive conversations—meetings, medical discussions, personal notes—local processing means your audio never touches a server. You control your data.
What's Next
The Orb is just the beginning. We're exploring streaming transcription (showing text as you speak), wake word detection, and custom model fine-tuning for domain-specific vocabulary.
Voice capture shouldn't be a separate activity you switch to. It should be woven into your workflow, available instantly without breaking flow. That's what The Orb aims to be.
Try it in Kowork and let us know what you think.
Share this post