A field guideVolume 01 · 2026

How Katchy works.

One hotkey. Four small stages. Three frontier models. A short essay on the interaction model behind a friendly, free, native macOS application that does, quietly, almost anything you can describe out loud.

Download Katchy Read the pipeline

Local-first · GPT · Claude · Gemini · macOS 14.2+ · Apple Silicon and Intel

01The interaction problem
02Anatomy of one request
03The stack, layer by layer
04The router decision
05Three brains, one menu bar
06What an agent can do
07Local where it matters

A note before we begin

With apologies to the broad audience: we are only one week into this project, so you may bump into a bug or two. We patch quickly, and we are very, very grateful you came to look. :D

The shape of the interface decides what is possible. Chat got us thinking. Tool use got us building. Agents finally let the model touch the same screen you do. Everything fun lives in that third era, and Katchy is the smallest, calmest, most-Macintosh-shaped window into it we could make.

The rest of this page is the under-the-hood guide. What happens between you pressing a key and an answer arriving in your ear. Why we route to a different frontier model depending on what you asked. What never leaves your Mac. We tried to keep it short.

01b · The thesis

What collaboration
actually needs.

Property 01

Copresence

We share the same object. Katchy looks at the same window, the same paragraph, the same Figma frame you do. It is not guessing at a description, it is reading the pixels you are reading.

In Katchy
ScreenCaptureKit single-frame, scoped to the active display.

Property 02

Contemporality

Feedback while the work is happening, not after. Katchy answers in the moment you stop speaking, while the question is still warm, no submit button, no spinner, no email thread.

In Katchy
Hotkey-up to first token in roughly 80 milliseconds.

Property 03

Simultaneity

We can both be doing things at the same time. You keep typing while an agent renames 47 screenshots. Katchy keeps reasoning while you scroll. Neither has to take a turn.

In Katchy
Off-main-thread agent loop, cancellable with ⌘ . at any moment.

What we read

Three properties, taken almost verbatim from Thinking Machines' essay on interaction models. They argue that real collaboration , with people, with code, with anything, requires all three at once. Most AI systems today are tuned for autonomous operation and miss them entirely. Katchy is built squarely on the interactive case.

01 · The interaction problem

Three eras
of talking
to a computer.

1.0Chat

You type, the model types back. You copy, you paste, you swap between tabs. Powerful, but the model only knows what you tell it and can never touch anything you can see.

All thought. No hands. No eyes.

2.0Tool use

Models started calling APIs. Read this file. Search this database. Send this email. Wonderful, but you still had to wire every tool together yourself in code.

Real power, but you are the plumbing.

3.0Agents

The model watches your screen, holds a plan in mind, takes actions, checks its own work, and only asks you when it really needs you. This is where Katchy lives.

Where Katchy lives.

Interlude · the bandwidth problem

“Like resolving disagreements
via email instead of
in person.”
- Thinking Machines, on the collaboration bottleneck

Chat is a single thread: until you finish typing, the model perceives nothing; until it finishes writing, the model perceives nothing. The channel is narrow. Voice plus a fresh screenshot is a much wider one, and the reason Katchy talks instead of typing.

02 · Anatomy of one request

What happens
in the eighty
milliseconds.

Every request flows through the same four-stage pipeline. Click a stage or just watch, the diagram auto-cycles every few seconds and stops the moment you take over.

Press ⌃ ⌥. Talk like a human.

The moment both modifiers go down, Katchy opens a low-latency audio buffer through CoreAudio. While the chord is held the waveform streams into a ring buffer; the instant you let go the recording stops. Nothing is sent until you finish speaking.

CoreAudio capture at 16 kHz mono
On-device Whisper transcription when supported
Voice activity detection trims dead air before upload
Buffer is discarded the moment the request finishes

02b · Architecture

Interaction model
in front. Background
model behind.

Front · synchronous

The interaction
model.

Stays present while you talk. Holds the conversation in short-term memory. Tracks whether you are thinking, yielding, or interrupting. Replies in roughly the time it takes you to blink.

Push-to-talk audio + the one screenshot.
Streamed tokens, never "please wait".
Cancellable mid-flight with ⌘ .

Back · asynchronous

The background
model.

Takes the slow, sustained work. A multi-step agent loop with tools: filesystem, AppleScript, Shortcuts, browser. Plans, acts, re-reads its own output, tries again. Reports back when done.

Runs off the main thread, never blocks the UI.
Shares the conversation context with the front.
Final result lands as a quiet menu-bar notification.

This is the two-part architecture Thinking Machines proposes, in miniature. The interaction model gives you the responsiveness of a small model. The background model gives you the planning and tool use of a large one. They share context. You never see the seams.

03 · The stack, layer by layer

Four small, boring,
extremely well-named
Apple frameworks.

01 · Listen

Push to talk

Hold Control and Option. macOS captures audio locally through CoreAudio. The waveform is transcribed on-device when possible, then trimmed and sent only if a frontier model is needed.

02 · See

A snapshot of your screen

When the question needs context, ScreenCaptureKit grabs a single, scoped frame. Katchy never streams video, never records continuously, never stores screenshots after the answer.

03 · Think

The right brain for the job

Katchy routes the request to whichever frontier model handles it best. Long PDFs to Claude. Vision-heavy tasks to Gemini. Code and quick edits to GPT. The router picks, you stay still.

04 · Act

Cursor + agents, in your menu bar

A friendly triangle points at the answer when one click is enough. A multi-step agent runs in the background when ten clicks are needed. Both share the same memory, both quit when you do.

04 · The router decision

One sentence in,
the right brain
on it.

A small classifier reads your transcript and the page tokens, then dispatches each request to whichever model fits best. Try a few yourself, the router shows its work.

Try a question

Router decision

“Summarise this 60-page PDF I just opened”

OpenAIGPT

AnthropicClaude

GoogleGemini

Multi-stepAgent

Why this one: Long context, careful reasoning over a structured document.

05 · Three brains, one menu bar

Different questions
deserve different
models.

OpenAI

GPT

Goes here for quick edits, code review, and the kind of structured rewriting where you want it back in a sentence and a half.

Tight rewrites
Code review
Quick edits

Anthropic

Claude

Goes here for long documents, careful reasoning, and anything where you would rather not have a confident wrong answer.

Long context
Careful reasoning
Nuanced writing

Google

Gemini

Goes here for screen-heavy moments, charts, slides, and the cases where the visual is half the question.

Vision
Charts and slides
Fast turnaround

06 · Capabilities

A short menu of
things you can just say.

Reading & writing

“Summarize that 60-page PDF”
Reading
“Draft a polite no”
Writing
“Write a quick changelog”
Writing
“Translate to Spanish”
Writing
“Resume yesterday's draft”
Writing
“Generate a weekly recap”
Writing

Files & system

“Rename 47 screenshots”
Files
“Clean up your Desktop”
Files
“Pull data from this CSV”
Numbers
“Convert these to PNG”
Files
“Open last screenshot”
Files
“Pin Spotify to the menu bar”
System
“What does this command do?”
Shell
“Convert this to a table”
Numbers

Daily flow

“Triage your inbox”
Mail
“Open this in Cursor”
Code
“Reschedule the standup”
Calendar
“Tag these photos by face”
Photos
“Mute Slack for an hour”
Focus
“What changed in this file?”
Code
“Add this to Reminders”
Tasks
“Find that PDF I lost”
Search

And anything else

These are a few from this week. The real list is anything you can say out loud while pointing at your screen. Agents do the rest.

06 · By the numbers

What a calm
agentic app
measures up to.

3

frontier models in one menu bar

0

servers we own, ever

1

hotkey is the whole UI

~ 80 MB

of disk to host all of that

80 ms

from hotkey-up to first token

0

files leave your Mac until you ask

07 · Across your whole Mac

Same loop. Every app.
Every workflow.

08 · Local where it matters

Your screen does
not leave your Mac,
unasked.

Local by default.

Audio is captured into RAM and discarded the moment the request finishes. Screenshots stay in memory. Conversations live in your Application Support folder, not on a server.

Smallest possible payload.

The router clips the audio to just the spoken part, masks the menu bar and dock out of any screenshot, and never sends conversation history the model does not need.

Bring your own keys.

Anthropic, OpenAI and Google keys live in your macOS Keychain. We never see them. You can pull or rotate them at any time.

Cancellable at every step.

⌘ . stops a request mid-flight. Agents check the cancel flag on every loop. There is no "please wait while we tidy up."

09 · A short reading list

Standing on
four sets of
shoulders.

Metis

Scott's practical-knowledge concept. Stochastic, intuitive, local. Reasoning that is appropriate when uncertainty is high and the right answer depends on the room. Agents need it.

Hayek's knowledge problem

Important knowledge lives in the particular circumstances of time and place. The screen in front of you, right now, is exactly that knowledge. Katchy reads it.

The Bitter Lesson

Sutton. Hand-crafted systems get outpaced by general capability + scale. So we keep the surface boring and well-named, and let the frontier models do the hard part.

Orality

Ong on the participatory nature of oral communication. Voice is closer to natural collaboration than typing into a box. Push-to-talk is not a trick, it is the right interface.

Footnotes

The bottleneck
Frontier models today are optimised for "autonomous, long-running" use. A recent frontier-model card admits that "when used in an interactive, synchronous, hands-on-keyboard pattern, the benefits of the model were less clear." Most real work is interactive. Katchy is built squarely on the interactive case.
Bandwidth
Chat is a single thread: until you finish typing, the model perceives nothing; until it finishes generating, it perceives nothing again. Thinking Machines call this a narrow channel for human-AI collaboration. Voice + a fresh screenshot is a much wider one.
Interaction model + background model
Their proposed architecture has two halves. An interaction model that stays present and synchronous. A background model that takes longer-horizon work asynchronously. Katchy maps cleanly: the menu bar is the interaction model, the agent loop is the background model, and they share context.

One last thing

One hotkey.
The whole interface.

Three minutes to download. One chord to remember. Zero euros, every day from now until the heat death of the universe.

Download Katchy Back to home

macOS 14.2+ · Apple Silicon and Intel · ~80 MB