Building Reliable Tool Execution at Scale

When we set out to build Aphelion, we knew that reliability would be the foundation everything else depends on. An AI agent that fails to execute a payment or send a message at a critical moment isn't just inconvenient—it breaks trust with end users and creates real business problems.

Today, our execution layer processes millions of tool calls daily with 99.9% success rates. This post walks through the key architectural decisions that got us there.

The Challenge

Tool execution in the real world is messy. APIs go down. Rate limits get hit. Network partitions happen. OAuth tokens expire mid-request. Each of the 50+ providers we integrate with has different failure modes, different retry semantics, and different ways of communicating errors.

Our job is to abstract all of this complexity away from developers while still giving them control when they need it.

Retry Logic That Actually Works

The naive approach to retries—exponential backoff with jitter—works fine for simple cases. But tool execution requires more nuance. Some failures are retryable (network timeouts, 503 errors). Others aren't (invalid parameters, insufficient permissions). And some require human intervention (expired OAuth tokens for user-scoped access).

We built a classification system that categorizes every possible failure mode for each provider. When an execution fails, the system determines: Is this retryable? If so, what's the optimal retry strategy? Should we notify the agent? Should we pause and wait for user action?

This classification is continuously refined based on real failure data. When we see a new error pattern, we analyze it and update the classification rules.

Circuit Breakers

When a provider starts failing, you don't want to keep hammering it with requests. That just makes things worse for everyone—the provider, your system, and your users who are waiting for responses that won't come.

We implement circuit breakers at multiple levels: per-provider, per-tool, and per-account. If Stripe's API starts returning 500 errors, we'll stop sending traffic to Stripe (but keep other providers running). If a specific tool is misconfigured and failing, we'll isolate just that tool. If a particular account hits rate limits, we'll pause just their requests.

The circuit breaker state is shared across our distributed infrastructure using a consensus protocol. When one node detects a failure pattern, all nodes learn about it within milliseconds.

Graceful Degradation

Not all tool executions are equally critical. Sending a Slack notification that fails can often be retried later or skipped. Processing a payment that fails needs immediate attention and possibly human intervention.

We let developers specify criticality levels for their tool calls. Critical executions get more aggressive retries, faster failover to backup providers (where available), and immediate alerting on failure. Non-critical executions get queued for later retry and don't block the agent's workflow.

Observability

You can't fix what you can't see. Every tool execution generates detailed telemetry: latency percentiles, error rates, retry counts, queue depths. This data flows into our monitoring system in real-time.

When something goes wrong, we can trace the exact path a request took through our system. Which node handled it? How long did each step take? What was the response from the provider? This makes debugging production issues tractable instead of a guessing game.

What's Next

We're working on predictive reliability—using historical data to anticipate failures before they happen. If we know that a provider tends to have elevated error rates on Monday mornings, we can proactively adjust our retry strategies and notify developers who depend heavily on that provider.

We're also exploring multi-provider redundancy, where tools with equivalent functionality across different providers can automatically fail over. If your email sending fails through SendGrid, we could route through Postmark instead—transparently and instantly.