Mar 1, 2026

Before ChatGPT: We Were Building AI That Does Things

By Peyton Spencer

In November 2021, our team sat around a screen and had a breakthrough: what if musicians could control synthesizers with their voice?

Not with pre-programmed commands like Alexa. With natural language. “Make the attack softer.” “Open up the lowpass filter.” “Generate ten good-sounding notes.”

We found GPT-3’s API, saw the potential, and started building. Our meeting notes from November 30, 2021 lay out the vision:

“GPT-3 assistant ‘ditto’ becomes the omnisynth assistant. For voice activated tuning synth parameters. Simple to implement.”

Here’s the thing most people miss about this story: we weren’t predicting chatbots. We were predicting agents — AI that takes actions on your behalf. That distinction matters more than we realized at the time.

The Context: Pre-ChatGPT 2021

It’s hard to remember now, but in late 2021, AI meant image recognition and spam filters. GPT-3 existed, but barely anyone was using it. The general public had no mental model for “large language model as interface.”

We weren’t AI researchers. We were building hardware music synthesizers — a team of engineers who’d turned a senior design project into an LLC. But we kept hitting the same problem: music production software is hard to learn. Too many knobs, too many parameters, too steep a learning curve.

From our June 2021 notes on teaching synthesis:

“They learn the concepts but fail to implement. Incorrect environment setups. Debugging is stepping back to see the big picture.”

Our solution wasn’t a better tutorial. It was a better interface:

“Lay it out clearly step-by-step. Trial-and-error on rails. So they never deviate too far off course. Giving them clean, organized software. Human centric software.”

But traditional UI/UX wasn’t cutting it. You still needed to know what a “lowpass filter” was to find the lowpass filter knob.

July 2021: First AI Mention

In July, our team started talking about neural networks:

“Neural nets: need to figure out model training scheme. OmniAura neural network framework. Correlate text & computer vision. GPT-J.”

GPT-J was an open-source alternative to GPT-3. We didn’t pursue it immediately — we were focused on hardware, patents, and getting our first $25K investment wired. But the seed was planted.

By September, we were already thinking about interconnected systems:

“Done correctly, all our products can talk to each other (like Apple, but open-source Apple).”

We were building toward an ecosystem where everything speaks to everything else. We just didn’t know AI would be the connective tissue.

November 2021: The Breakthrough

By November, we’d discovered OpenAI’s GPT-3 API. And suddenly everything clicked.

What if you didn’t need to learn the interface? What if you could just talk to your synthesizer?

From our November 30, 2021 meeting notes:

“GPT-3 = user speech to structured data. Makes your workflow enjoyable and accessible.”

The use cases wrote themselves:

“Okay Omni, increase filter to 800 Hz”
“Make the attack softer”
“Generate 10 good sounding notes”
“Click new song”

And the note that still gives me chills:

“This command should spit out structured data to interact with GUI.”

Read that again. In November 2021 — a full year before ChatGPT — we were designing an AI that takes your voice input, translates it to structured commands, and executes actions in software. That’s not a chatbot. That’s an agent.

We named the assistant Ditto. And the vision wasn’t just about music. From the same month:

“Building with frameworks lets us be exponentially faster.”

We were thinking about AI as a force multiplier for all software interaction, not just music production. We just started with what we knew.

December 2021: The Architecture

By December, we had the full architecture mapped out.

The key insight, documented in our December 13 notes:

“Design pattern: using model as interface.”

We weren’t building an AI product. We were using AI as the interface layer to our existing music production tools. The model doesn’t create music — it translates your intent into the actions that create music.

This was fundamentally different from how most people thought about AI in 2021. Everyone else saw AI as the product — chatbots, image generators, recommendation engines. We saw it as infrastructure. A translation layer between human intent and software execution.

From the same meeting:

“Executives in big companies understand AI but don’t know how to apply it. New field emerging: the layer on top of the AI researching. We are the layer on top.”

We even had the business model figured out:

“Sell monthly subscription which provides regular feature updates and access to OMNI AI assistant (this pays for calls to GPT-3). Critical revenue stream while rapidly upgrading software. Users will desire the latest and greatest from us.”

$3/month for GPT-3 access. We knew the API wasn’t free, and we wanted to build a sustainable business around it.

How It Would Have Worked

The technical pipeline we designed was straightforward but novel for 2021:

User speaks: “Make the attack softer”
Speech-to-text converts to a written command
GPT-3 translates to structured JSON: {"parameter": "attack", "action": "decrease", "amount": "moderate"}
Synthesizer receives the equivalent MIDI CC message
Sound responds in real-time

Natural language → structured data → software action. No menus. No knobs. Just say what you want and the software does it.

We had the full pipeline mapped out. What we didn’t have was the context window or reliability that would come with later models. GPT-3 in 2021 was brilliant but unpredictable — sometimes you’d get perfect JSON, sometimes you’d get a hallucinated novel.

Here’s what’s wild: this is exactly the same pattern behind every agentic coding tool in 2026. You say “refactor this function,” the AI translates that into structured edits, and the code changes. Same pipeline. Different domain.

The Three Waves

Looking back, there are three distinct waves of AI interfaces. Our 2021 vision didn’t map to the wave most people assume.

Wave 1 — Conversational AI (November 2022): ChatGPT launches and becomes the fastest-growing consumer app in history — 100 million users in two months. Everyone discovers you can talk to AI and get useful answers. It’s brilliant, but it’s still just chat. You ask, it answers. The AI talks to you.

Wave 2 — Copilot AI (2023–2024): GitHub Copilot, Cursor’s autocomplete, ChatGPT plugins. AI sits alongside you, suggesting code, writing drafts, completing your thoughts. The AI works with you.

Wave 3 — Agentic AI (Late 2025–2026): AI that independently plans, executes, and completes multi-step tasks. Claude Code now authors 4% of all GitHub commits. Cursor hit $1B ARR in 24 months — the fastest B2B SaaS ramp in history. Devin operates autonomously in its own sandboxed environment. AI inference costs dropped 92% in three years. The AI works for you.

Our 2021 vision — voice commands translated into structured actions that control software — that’s Wave 3. We weren’t predicting chatbots. We were predicting the agentic layer that took four more years to arrive.

We just had the wrong application. And we were early. Very early.

What We Got Right (And Wrong)

What we got right:

AI as interface layer, not product — We said “we are the layer on top.” That turned out to be the whole game. Every successful AI company in 2026 is a layer on top.
Voice/natural language → structured action — “Make the attack softer” → JSON → software action. This is exactly how every AI coding agent works today. “Refactor this function” → AST edits → code changes. Same pattern, different domain.
The model isn’t the product — We were selling access to the experience, not the model. Today, inference costs have dropped 92% in three years. The model is a commodity. The layer on top is what people pay for.
Subscription model for API costs — We planned $3/month to cover GPT-3 API costs. Today, every AI SaaS charges for inference. We had the business model right before the market existed.

What we got wrong:

We thought it would be niche — We were building for music producers. We didn’t see that everyone would want this for everything.
We were four years early on agentic AI — The models weren’t reliable enough in 2021. GPT-3 couldn’t consistently output structured data at the quality we needed.
We tried to skip a step — ChatGPT proved that conversational AI was the gateway to agentic AI. We tried to jump straight to AI that takes action without the world first accepting AI that just talks.

Why We Didn’t Ship

We saw it coming. The meeting minutes prove it. November 2021 — voice-controlled music production with GPT-3. Full architecture. Business model. Use cases.

But we didn’t ship. And it’s worth being honest about why.

GPT-3 wasn’t reliable enough. Real-time music production demands millisecond responses, not 2-3 second API calls with inconsistent output. The “simple to implement” note from our meeting was optimistic. GPT-3 was powerful but its outputs were unpredictable — fine for creative writing, not fine for structured JSON that controls hardware.

We were a hardware team stretching too thin. We were simultaneously designing PCBs, filing patents, preparing for CES, and exploring AI integration. From our CES 2022 learnings:

“Manufacturing. We are just a bunch of developers.”

Something had to give. The AI piece was the most speculative, so it kept getting deprioritized.

Life happened. In 2022, I landed my first professional dev job. Between the learning curve of a new career and OmniAura’s hardware challenges, active development went on pause. The brutally honest year-end assessment from December 2021 had already set the tone:

“Have sold nothing, made no money, purchased parts, travel expenses.”

We had clarity about where technology was going. We didn’t have the bandwidth to chase it — yet.

Post-CES, the clarity crystallized:

“No one is thinking like we are. We have a serious opportunity. Time to be serious.”

We had the opportunity. We just needed a few more years to understand what shape it would take.

2022–2023: The Quiet Years (That Weren’t Actually Quiet)

Here’s the part of the story people don’t know about.

While I was heads-down at my first dev job learning the craft, my co-founder Omar never stopped building. He took the Ditto vision from our 2021 meeting notes and started shipping — not as a music assistant, but as a full smart home AI.

By mid-2023, Omar had a working system running on a Raspberry Pi:

Voice-activated smart home control via Home Assistant — lights, locks, cameras, all controllable through natural language
An NLP server with custom intent recognition, named entity extraction, and a LangChain memory agent with long-term knowledge graphs
Wake word detection — you could say “Hey Ditto” and it would listen, just like we’d imagined in 2021

Then in October 2023, Omar built something genuinely ahead of its time: an Image RAG agent that gave Ditto eyes. The system combined a lightweight image captioning model with an LLM — the LLM could ask questions about images, reason over the answers, and respond coherently. He’d essentially built a vision-language model (VLM) by composing smaller models together.

Here’s the timeline that matters: Omar’s vision server went live on October 29, 2023. OpenAI didn’t release the GPT-4V API until November 6, 2023 — eight days later.

By November 2023, the system had facial recognition for the security cameras. By December, the knowledge graph agent was building Neo4j graphs from conversations in real-time — a memory system that grew as you talked to it.

The full Ditto Stack — NLP server, vision server, smart home integration, web UI — all served from a local Raspberry Pi. No cloud dependency. No subscription. Just a small computer that understood your home and remembered your conversations.

This was the 2021 vision alive and evolving. Not for synthesizers anymore, but for something bigger: an AI that lives in your home, sees what you see, and actually does things when you ask.

The legacy assistant repo captures what this era looked like: Google TTS/STT, Keras, NLTK, SpaCy, GPT-3/4, HuggingFace models, Home Assistant, Spotify integration, and Teensy-powered LED light strips. All the threads from 2021 — voice control, smart home, lights, AI — woven together on a Pi.

What We’re Building Now

The vision evolved. We’re no longer building voice control for synthesizers. We’re building something that makes more sense in 2026: AI that actually remembers you.

The insight is the same: AI as interface layer. But instead of translating “make it softer” into synthesizer parameters, we’re building AI systems that remember your context, learn your patterns, and coordinate action across your entire digital life.

Ditto — the assistant we named in November 2021 — became something much bigger. An agentic memory system that understands your conversations, knows your projects, and provides the context that makes AI assistants genuinely useful instead of just impressive demos.

And the multi-agent orchestration we’ve been building? That’s the “layer on top” we described in December 2021 — just applied to a much bigger problem than music production.

We saw the interface layer problem in 2021. It took us a few years to understand that the real application wasn’t controlling synthesizers. It was giving AI the memory and autonomy to work for you, not just talk to you.

The Artifacts

If you’re curious, here’s the full timeline — from meeting minutes to GitHub commits:

July 21, 2021: First mention of GPT-J and neural networks
November 30, 2021: GPT-3 assistant “Ditto” designed for voice-controlled software
December 9, 2021: Business model ($3/month subscription to cover API costs)
December 13, 2021: “Model as interface” design pattern documented
January 2022: CES — hardware validated, but “we are just a bunch of developers”
2022: Development pauses — Peyton starts first dev job, Omar keeps building
November 30, 2022: ChatGPT launches — conversational AI goes mainstream
September 2023: NLP server ships — intent recognition, LangChain memory agent, knowledge graphs
October 29, 2023: Vision server goes live — Image RAG before GPT-4V API (Nov 6)
November 2023: Facial recognition for security cameras, knowledge graph pipeline
Late 2025: Agentic AI arrives — Claude Code, Cursor, Devin prove AI can do things, not just talk about them

Four years from our first design to full industry vindication.

We weren’t just early on chatbots. We were early on agents.

We saw the future clearly. We just didn’t see the path it would take to get here.

Before ChatGPT, There Was Ditto — The full origin story
Introducing Ditto — What we’re building today
Understanding AI Memory Systems — How Ditto remembers
Connect Ditto to Any AI Assistant — MCP integration guide

Sources

ChatGPT — Wikipedia — Launch timeline and adoption stats
Eight Trends Defining How Software Gets Built in 2026 — Claude Blog — Claude Code commit statistics
2026 Agentic Coding Trends Report — Anthropic — Inference cost reductions, developer adoption
Best AI Coding Agents for 2026 — Faros AI — Cursor ARR, agentic tool landscape
AI Agents Arrived in 2025 — TechXplore — The shift from research to everyday tools
GPT-4V System Card — OpenAI — GPT-4 Vision timeline
GPT-3 — Wikipedia — 2020-2021 API landscape
Ditto NLP Server (archived) — LangChain memory agent, knowledge graphs
Ditto Vision Server (archived) — Image RAG, pre-GPT-4V
Ditto Legacy Assistant (archived) — Full smart home AI stack
Ditto Stack (archived) — Docker-based local deployment