Inference Quotas – Matt Herzog

At LiveKit, we make it easier to build voice agents by unifying the model stack behind a single platform. Instead of asking developers to create separate accounts, manage API keys, and pay multiple providers independently, LiveKit handles those provider relationships directly. Developers pay us, we manage the underlying model infrastructure, and the agent framework ties everything together into a seamless voice experience.

That simplification created a new product challenge.

Voice agents rely on multiple model types working together in parallel, typically an LLM, speech-to-text, and text-to-speech provider. Each of those providers has its own concurrency constraints, or limits on how many agents can connect at the same time. On top of that, LiveKit's pricing model includes concurrency limits by model type, and developers can configure fallbacks across providers. For example, a team might use ElevenLabs as their primary TTS provider and Cartesia as a fallback if the primary provider is unavailable or if usage exceeds the allocated limit.

As inference became more central to how customers built on LiveKit, the existing quota experience no longer provided enough visibility or control.

The problem

We needed to help developers answer three core questions:

How should I plan and assign concurrency across the models my application depends on?
How can I forecast what I'll need based on historical usage?
If I'm approaching a limit or something goes wrong, how do I see that quickly?

The existing experience was a simple table listing quota types, current limits, peak usage over the last seven days, and a way to request an increase. That worked at a high level, but it became much less useful as we introduced more providers and more granular concurrency limits at the individual model level.

The old model treated quotas as static configuration. In reality, developers needed a system that supported capacity planning, monitoring, and quick decision-making.

The solution

I redesigned the quota experience around how developers actually reason about inference in production.

Instead of showing quotas as a flat list, the new design breaks each model type into its own section, then lists individual models as rows within that group. This makes it much easier to understand limits in context and compare them against real usage. For each model, developers can immediately see their current limit, their recent peak usage, and a clear path to request an increase.

Redesigned inference quota experience — The redesigned quota experience

When a model's usage approaches or exceeds its limit, the interface makes that immediately visible. Rows highlight at-risk models so developers can act before hitting a wall — and we'd notify them over email as well.

Warning state when model limits are at risk — At-risk model highlighting

To support forecasting, I introduced a time-series graph that shows historical concurrency for each model over the last seven days. This gives developers more than a single peak number. They can now understand whether a spike was a one-off event, part of a repeating trend, or a sign that they are consistently running close to their limit.

The most important part of the design was tying the table and chart together through interaction.

Selecting a model row highlights the corresponding line in the chart, making it easy to focus on a single provider or configuration. The chart also visualizes usage against the assigned limit, which helps developers quickly judge risk. Hover states reveal the exact concurrency usage at a given moment in time, allowing teams to inspect when they were actually near capacity versus when they had healthy headroom.

Chart and table interaction

This transformed the experience from a static admin table into a more exploratory operational tool.

Additional workflows

I also considered the case where a team needs capacity for a model they are not currently using. From a secondary menu, developers can request additional concurrency for a specific model or add a custom model request. That workflow includes the desired increase and the reason for the request, making it easier to support both planned growth and more bespoke use cases.

Request dialog for additional model concurrency — Concurrency increase request dialog

Bonus: Connection tiles

Even though this project was focused on inference and models, I wanted to improve the overall look and feel of our other connection UI to better align with the mindset of developers coming to this experience. I introduced a similar treatment that highlighted peaks and potential service disruptions.

Redesigned connection tiles with peak and disruption indicators — Updated connection tile treatment

Outcome

This project gives developers a much clearer mental model of how inference capacity works across providers. It helps them plan ahead, monitor real-world usage, and respond faster when they begin to approach their limits.

The project is currently in development and expected to ship soon.

The problem

The solution

Additional workflows

Bonus: Connection tiles

Outcome

LiveKit Cloud