Other Types
Engineering

How we generate 100+ SDKs, and why AI still only comes second

Published:
Back to all blogs

We had 10 APIs and 11 languages to maintain. We built automation. Then LLMs showed up, and "automation" suddenly meant something else entirely.

For years, that word described our pipeline at Algolia: OpenAPI specs go in, client libraries come out, same result every time. Then a different kind arrived. One that doesn't read specs but reads intent; doesn't guarantee output but guesses at it. Same word. Completely different mechanism. We're now running both, stacked on top of each other.

This is about what happens when you let "automation" mean something new.

What automation used to mean

Before anyone had opinions about LLMs, "automation" meant something boring and specific. You write a machine-readable description of an API. You feed that description to a code generator. You get client libraries. Same input, same output, every time.

Algolia is an API-first platform. Search is the core, but we also run recommendations, analytics, personalization, A/B testing, and several other products, each with its own API. Ten of them, at last count. We support 11 programming languages: Java, Kotlin, Dart, Scala, TypeScript, Python, Ruby, PHP, Swift, Go, and C#. On top of that, we maintain framework integrations: Rails for Ruby, Django for Python, Laravel and Symfony for PHP.

No intelligence. No inference. That's what the word meant for a long time, and honestly, it got us further than most people would expect.

How many engineers does it take to change a lightbulb?

Here's the math. Eleven languages times ten APIs, plus four framework integrations, gives you well over 100 packages to maintain. "Maintain" is doing a lot of work in that sentence. Each package has to be built, tested, kept consistent with the others, have its bugs fixed, be released, and be documented. Multiply any small task by 100 and it stops being small.

how-many-languages.jpg

So how many people does that take?

We tried the obvious answers first. A dedicated team per language: one person owns Python, another owns Java, and so on. Sounds reasonable until you realize each person needs to understand all ten APIs. The cognitive load is enormous. And when no new API features ship for a month, your Ruby developer has nothing to do. People are expensive to have sitting idle.

We tried the inverse. Embed an engineer in each API team and let them update the clients directly. Now the cognitive load flips: API engineers know their features, not the quirks of eleven different languages. Context switching breeds bugs. Nobody becomes an expert in any client. And who owns cross-cutting concerns like retry logic?

Both approaches collapse under the same pressure: the combinatorics are too large for any staffing model that puts humans in individual cells of the matrix.

The answer is three. Three engineers maintain all 100+ packages. But they don't write client code by hand. They maintain the pipeline that generates it. The pipeline is the punchline.

Those three engineers inherited a system built by the people before them and the open-source tools it depends on. You don't rewrite foundations that work. The most important one was waiting for us.

Enter OpenAPI

OpenAPI is the most widely used API description standard in the world. You write YAML or JSON blueprints that describe a RESTful API: endpoints, parameters, authentication, schemas, responses. A machine-readable contract for what your API accepts and returns.

We wrote one spec per API, ten in total. A lot of manual work: introspecting every endpoint, writing out every parameter and response type. The OpenAPI standard made it tractable, if time-consuming. Once those specs existed, they became the single source of truth for everything that followed.

The pipeline has layers. Our custom CLI, built on Docker so the dev environment is identical everywhere, validates and lints each spec against the OpenAPI standard. Then it compiles our code generators, written in Java, which extend the base generators that OpenAPI provides. Those base generators are deliberately generic; you're expected to add your own specifics. We did. On top of the generators sit Mustache templates, a second overlay that handles language-level details like casing conventions and parameter naming.

We write code that writes code. That phrase sounds cute until you've spent a month debugging a Mustache template at two in the morning, but the result is worth it. The generated client code has full documentation comments. Every method matches what the actual API expects. Find a bug in the generator, fix it once, fix it everywhere. That kind of leverage is hard to give up once you've had it.

There's a step people don't think about: formatting. Normally your IDE handles whitespace and style automatically. Generated code doesn't have that luxury. Without a formatting pass, you can't meaningfully diff between two generations, and your changelogs become noise.

Tests follow the same model. We wrote JSON specifications for our tests, built generators specifically for test code, and now generate tests the same way we generate clients. Tests are code too; there's no reason the pipeline should stop at production source files. We run close to a thousand tests across all languages. All passing. Some unit, some end-to-end hitting live Algolia APIs.

The release step generates documentation that auto-updates when anything changes on the API side. It generates code samples, the little snippets you see in docs pages and dashboards. Then it publishes to each language's package manager: npm, RubyGems, PyPI, Maven, CocoaPods, and the rest. Some languages need multiple builds; Kotlin and Swift each ship for multiple platforms.

The whole pipeline runs in about twenty minutes. Spec change to published packages: twenty minutes. 

Everything is open source: github.com/algolia/api-clients-automation.

The 80/20 line

Here's the part I think is actually the most interesting, even though it sounds like an admission of defeat: about 80% of our client code is generated. The other 20% is written by hand.

That 20% exists because some things can't be derived from a spec. Python's async model, Java's generic types, our retry strategy: design decisions shaped by how each language's ecosystem actually works. No spec tells you that.

You could look at that split and think "why not push to 95/5?" We thought about it. The complexity cost is steep. The deeper you push automation into language-specific territory, the more your generators need to know about each language's idioms, concurrency model, and type system. The generated code gets worse, not better. The templates become unreadable. You trade clean hand-written code for ugly generated code that nobody can debug.

Knowing where to stop automating is harder than knowing where to start. We found our line at 80/20 and stayed there. Deliberately.

There's no point in trying to automate everything. You have to assume some part will always need a human. At least for now. I said those words in a talk in late 2025. That phrase is aging in interesting ways. I still stand by it, even if the word "automation" keeps shifting what "for now" means.

Automation, redefined

Then LLMs happened, and the word "automation" developed a second meaning overnight. Our kind reads a spec and produces code deterministically. The new kind reads a sentence in English and produces code probabilistically. They sound like the same thing. They're not.

Classical automation is a factory. AI automation is an intern. The factory can't improvise. The intern can't be trusted unsupervised. The question is: what can the intern do that the factory can't?

An assistant

Here's a problem we didn't anticipate. Developers started asking LLMs for Algolia code, and the LLMs answered confidently with code that was wrong. Not always obviously wrong. Wrong in subtle ways: a deprecated method name here, a parameter that used to be optional but isn't anymore there. The LLMs were trained on old documentation, Stack Overflow threads from three years ago, blog posts written against previous major versions. A developer asks "how do I search with facets in Python" and gets back something that looks plausible, compiles, and does the wrong thing at runtime.

The fix is simpler than you'd think. MCP, the Model Context Protocol, is an open standard (originated at Anthropic) that lets LLMs call external tools during a conversation. We built an Algolia MCP server. When a developer asks an LLM for Algolia code, the LLM calls our server, which generates a snippet from the same specs and templates that our pipeline uses. The LLM doesn't write the code. It asks our generator to write it. Then it adapts the snippet to the developer's context, wraps it in an explanation, and delivers it.

This works with any MCP-compatible tool: Claude, Cursor, GitHub Copilot, Zed, and others. We deliberately chose not to fine-tune a model. Fine-tuning is expensive, drifts out of date between retraining cycles, and can still hallucinate. Template-based generation from a spec can't.

At this level, AI has zero autonomy. It's a delivery mechanism. Classical automation does all the real work; the LLM is a friendlier interface to our existing infrastructure. Useful, but contained.

An agent

Give the same system a bit more rope and things get interesting. We're building this part now. Not shipped yet, but the architecture makes it realistic.

Triage is the first obvious application. A bug report arrives on GitHub or through a support ticket. An LLM reads the report, checks the described behavior against the OpenAPI spec, and classifies it: client bug, API bug, or misunderstanding? It routes the issue to the right team and drafts a first response. Most of this currently eats an engineer's morning.

Resolution goes further. If a customer says "method X returns the wrong type in the Java client," the agent can check the spec for method X, check what the Java generator produces for that method, identify the discrepancy, modify the spec or the generator template, regenerate, run the generated tests, and open a PR with the proposed fix.

The reason this isn't hand-waving is how our architecture works. The changes are YAML diffs in spec files, not hand-edits scattered across eleven codebases. Regeneration is deterministic, so you can verify that a fix does what you think it does just by regenerating and diffing. The tests are themselves generated, so running them after a spec change tells you whether you broke something.

The word "tentative" matters here more than any other word in this article. The agent proposes. It does not merge. It opens a pull request, writes a description of what it found and what it changed, and waits. A human reviews the diff, checks that the spec change is intentional, regenerates to confirm, and merges or rejects. A coworker who does research, writes a first draft, and puts it on your desk. Not a coworker who ships to production while you're at lunch.

A replacement?

No.

The architecture is the clearest proof of why not. The pipeline is deterministic. That's the entire value proposition. Replace it with an LLM and you trade guaranteed correctness for probabilistic correctness. That's a downgrade. Nobody wants to wonder whether this week's Python release is subtly different from last week's because the model was feeling creative.

What AI can replace is the manual work that surrounds the pipeline. Triaging a bug report. Investigating whether a reported behavior matches the spec. Drafting a first-pass fix. Writing the PR description. Delivering the right snippet to a developer who's stuck. All of that is overhead. Important overhead, but overhead.

The pipeline produces correct code. AI helps us find problems faster, propose fixes sooner, get the correct code into developers' hands more efficiently. The engineering stays deterministic. The coordination around it is where probability earns its keep.

If AI isn't here to replace the pipeline, what is it here to do?

The trust gradient

Not all automation deserves the same level of trust. A formatting pass is deterministic; let it run unattended. A generated test is deterministic; add it to CI and forget about it. An MCP snippet is almost deterministic: the code itself comes from the same templates as your clients, but an LLM chose when to surface it and how to frame it. Triage involves real classification judgment. The LLM might get it wrong. PR drafting involves proposing code changes. The LLM will sometimes propose the wrong change.

I think about this as a gradient. On the left: full trust, no review, deploy automatically. On the right: limited trust, human review mandatory, every output treated as a draft. Most of the interesting work happens in the middle, and the position depends on what happens when the automation is wrong.

Two symmetric mistakes to avoid. The first: trusting AI the way you trust a linter. It's not deterministic. It doesn't follow fixed rules. Accepting its output uncritically will burn you, usually at the worst possible time. The second: refusing to use AI because it's not a linter. Insisting on deterministic guarantees for every task means rejecting tools that are genuinely useful where determinism isn't available.

The right question isn't "can I trust this AI?" It's "what happens when it's wrong, and how fast do I find out?" A misclassified bug report costs you fifteen minutes of re-routing. Cheap. Let the AI triage. A bad spec change ships a broken client to 100+ packages. Catastrophic. Keep the human in the loop. The consequence of failure, not the probability of failure, determines where the line goes.

Why the needle won't move on the 80/20 line

You might expect AI to push our 80/20 split to 90/10 or 95/5. I don't think it will.

The 20% that's hand-written requires understanding you can't derive from a specification. Python's async/await model isn't a translation problem; it's a design problem. You have to know that Python developers expect a synchronous client and an asynchronous client as separate classes, that the async client should use aiohttp and not requests, that certain methods need different signatures in async context. A spec tells you what the API accepts. It doesn't tell you how a language's community expects to interact with it.

The same goes for retry logic, generic type parameters, framework-specific adapters. These are decisions, not translations. An LLM can help you make them faster: starting with a 70%-right draft of an async Python client is genuinely better than starting with a blank file. But someone still has to know that the draft is 70% right and which 30% to fix.

The 80/20 split was never a failure to automate further. It was a recognition of where the nature of the work changes. AI won't move that boundary. It will change the cost of the work on both sides. Generated code gets generated faster: better tooling, faster generators. Hand-written code gets written faster: AI-assisted drafting, better autocomplete. The ratio stays. The throughput improves. The same three people cover more ground.

The crescendo

Here's what I keep coming back to.

An LLM operating on nothing, given a vague prompt and no context, produces mediocre output. Everybody's seen this. "Write me an API client in Python" gets you something that looks like a client and works like a demo. An LLM operating on top of a verified, structured spec produces something materially better, because the spec constrains what the LLM can get wrong. The foundation limits the chaos. Probabilistic automation is stronger when it has a deterministic base to stand on.

Zoom out one level. A human decides the architecture. A human draws the 80/20 line. A human reviews the pull request and decides whether the proposed spec change actually fixes the bug. AI handles the volume: scanning a hundred GitHub issues overnight, drafting responses, proposing spec patches, regenerating client code to test them. The human provides judgment. The AI provides throughput. Neither is sufficient alone.

Zoom out one more level and you see what we're building: a second automation pipeline on top of a first one. The first is classical. Specs feed generators, generators produce client code, client code gets tested and released. Deterministic, verified, boring in the best way. The second will be AI-driven. Reading bug reports, checking them against specs, proposing fixes, opening tentative PRs. Probabilistic, reviewed, useful in a way that would have sounded like science fiction to the team that wrote the first spec file in YAML five years ago.

The full stack: specs, then generators, then code, then AI triage, then AI fix, then tentative PR, then human review. Each layer trusts the one below it. The AI trusts the specs because the specs are the source of truth. The human trusts the AI because the AI's work is reviewable. And the code that ships is still deterministic, still generated from the same pipeline, still correct by construction.

Three engineers, eleven languages, ten APIs, 100+ packages. That hasn't changed. What those three engineers can reach for has. But only because we were willing to let "automation" mean something new.

You can also watch the presentation I gave on this topic at last year's DevBit: 

Get the AI search that shows users what they need