Model routing: how we use 15+ LLMs without losing our minds

Here’s something that surprised me when we started building this, the difference between the best model and the right model is enormous because Claude Opus is brilliant at complex reasoning but it’s also wildly expensive for classifying whether a message is a question or a command while a local 30-billion parameter model running on our GPU does that classification just as well at zero marginal cost with full data privacy, and the industry talks about AI models like there’s a leaderboard and you should use whatever’s on top which is wrong, what you actually need is a routing system that matches every task to the model that’s best suited for it so that’s what we built.

The model directory

We maintain a directory of 53 active models where each one has a detailed profile covering how fast it responds and which provider serves it and where it excels and where it falls down, and the capabilities aren’t vague marketing claims, they’re scored across 8 dimensions on a 0-100 scale from tool calling to reasoning to creative writing, those scores come from benchmarks not intuition, and every model also has a confidence rating on its scores because a result from a controlled test suite carries more weight than a result from someone’s blog post.

This matters because routing decisions based on vibes don’t hold up and “Claude is good at reasoning” is true but useless but “Claude Opus 4.6 scores 97 on reasoning, 95 on instruction following, and costs $15 per million input tokens” is something you can actually route on.

Shortcuts instead of model names

Nobody wants to think about model names when they’re building features so we created a shortcut system with 25 shortcuts that map to the right model for a given task type where “Smart” gets you a strong general reasoning model and “Fast” gets you a lightweight model for quick classification and “Workhorse” gets you a local model running on our own hardware for zero-cost fully private processing, and the developer never touches a model name they pick a shortcut that describes the job, and the shortcuts abstract away the specific model behind them so if Anthropic releases a new model tomorrow that outperforms Sonnet on reasoning we update the “smart” shortcut to point to the new model and every part of the system that uses “smart” automatically gets the upgrade with no code changes anywhere, and this decoupling was one of the better decisions we made early on because the AI model landscape changes every few weeks and if your codebase is full of hardcoded model names every improvement in the field means a code migration but if your codebase uses capability-based shortcuts every improvement is a config change.

One-hop failover

Models go down and providers have outages and rate limits hit so we designed for this from the start with every model in the directory having a backup model and a secondary backup so if the primary fails the system falls to the backup and if that fails it falls to the secondary and if that fails it surfaces a hard error, and we deliberately limited this to one hop because the temptation was to build a deep fallback chain (try this then this then this then this) but cascading fallbacks create unpredictable latency and make it nearly impossible to reason about what model is actually serving your request at any given time so one hop keeps the system predictable, and there’s also a circuit breaker that tracks consecutive failures per provider so if Anthropic’s API fails 3 times in a row the system opens the circuit and routes around it until the provider recovers which prevents a failing provider from turning every request into a slow failure.

The three entry points

From the code side using the routing system is simple because there are three functions, one for streaming responses where you get chunks as they arrive and one for async complete responses where you get the full thing when it’s done and one for synchronous calls that blocks until complete and all three take the same inputs a model shortcut or model ID and the messages and any configuration, and we locked a rule early on that nobody in the codebase is allowed to import a provider SDK directly so no import anthropic and no import openai and every LLM call goes through the routing system which felt like overkill at first but it turned out to be critical because when we added a new provider we didn’t have to hunt through the codebase to find direct API calls because there were none.

The cost dimension

Running 53 models across 7 providers gets expensive fast if you’re not paying attention and every model in the directory has granular pricing data down to the cost per thousand tokens in and out and the routing system tracks spending per model per agent per tenant, and this isn’t just for billing because it shapes routing decisions where for a task that a $2/million-token model can handle routing it to a $15/million-token model because it’s “better” is wasteful and the local models on our GPU server cost nothing beyond electricity so for privacy-sensitive tasks or high-volume processing routing to local models is both cheaper and more secure, and we also run free-tier daily limits where some models offer free tiers with rate limits and the system tracks usage against those limits and switches to paid routing when free capacity is exhausted which sounds minor but it adds up to real savings on development and testing workloads.

What I got wrong

I originally thought we’d need 5 or 6 models (a reasoning model, a fast model, a coding model, a local model, maybe a specialized one for search) but the actual number is 53 and growing because each one exists because there’s a specific task profile where it’s the best choice, and the embedding model that turns text into vectors is tiny and runs locally and the extraction model that pulls entities from text for the knowledge graph is also local and the reasoning model for complex agent cognition is large and expensive and the model that handles simple classification is small and cheap so none of these should be the same model and pretending they could be would either waste money or degrade quality, and building the routing system felt like over-engineering at the time but it’s the piece of infrastructure I’m most glad we built early.