LLM Router
Это содержимое пока не доступно на вашем языке.
The LLM Router is PRX’s model selection engine — 2,808 lines of Rust that decide which provider and model handles each incoming request. It balances quality, cost, latency, and capability to make optimal routing decisions in real time.
Routing Flow
Section titled “Routing Flow”Incoming Request │ ├─ 1. Intent Classification │ Categorize the request (code, chat, analysis, translation, etc.) │ ├─ 2. Model Selection (Scorer) │ Score all candidate models and rank them │ ├─ 3. Reliability Fallback │ If selected model is unavailable, fall through the chain │ ├─ 4. Automix │ Start with a cheaper model; upgrade if confidence is low │ └─ 5. Record Outcome Log result for Elo updates and future routing decisionsScoring Formula
Section titled “Scoring Formula”Each candidate model receives a composite score:
score = alpha * similarity + beta * capability + gamma * elo - delta * cost - epsilon * latency| Factor | Weight | Source |
|---|---|---|
similarity | alpha | KNN semantic distance between request and model’s best-performing past requests |
capability | beta | Static capability matrix (coding, math, reasoning, multilingual, vision, etc.) |
elo | gamma | Dynamic Elo rating updated after each completed request |
cost | delta | Per-token price (input + output) |
latency | epsilon | Rolling average response time for this model |
Weights are configurable and can be tuned per channel or per user to prioritize quality over cost or vice versa.
Components
Section titled “Components”Intent Classification
Section titled “Intent Classification”The classifier maps each request to one or more intent categories:
| Intent | Description | Preferred Capabilities |
|---|---|---|
code | Write, debug, or review code | Strong coding benchmarks |
chat | Casual conversation | Low latency, cheap |
analysis | Data analysis, complex reasoning | High reasoning capability |
translation | Language translation | Multilingual strength |
vision | Image understanding | Vision model required |
math | Mathematical problem solving | Math/reasoning benchmarks |
creative | Writing, brainstorming | Creative fluency |
tool_use | Agentic workflows with tool calls | Native tool calling, instruction following |
Classification is fast — it uses keyword heuristics and a lightweight model call when ambiguous.
Capability Matching
Section titled “Capability Matching”Each model has a static capability profile that rates it across dimensions:
claude-sonnet-4: coding=0.95 reasoning=0.93 creative=0.90 speed=0.80 cost=0.70gpt-4o: coding=0.90 reasoning=0.88 creative=0.85 speed=0.85 cost=0.75gemini-2.5-pro: coding=0.88 reasoning=0.90 creative=0.82 speed=0.82 cost=0.80llama3.1-70b: coding=0.75 reasoning=0.70 creative=0.72 speed=0.90 cost=0.95The Router multiplies the intent-relevant capability scores by beta to produce the capability component of the final score.
Elo Rating
Section titled “Elo Rating”Every model maintains an Elo rating that updates after each request. When a request succeeds (user accepts the response, no retry needed), the model gains Elo. When a request fails or is retried on a different model, the model loses Elo.
This creates a self-correcting feedback loop: models that perform well in practice rise in ranking, regardless of their static benchmarks.
KNN Semantic Routing
Section titled “KNN Semantic Routing”The Router maintains an embedding index of past requests and their outcomes. For each new request, it finds the K nearest past requests (by embedding similarity) and checks which models performed best on similar inputs.
This is especially valuable for specialized domains — if a particular model consistently handles SQL questions well in your environment, the Router learns to prefer it for SQL-related requests.
Automix
Section titled “Automix”Automix is a cost optimization strategy:
- Route the request to a cheaper model first (e.g., Haiku, GPT-4o-mini)
- Evaluate the response confidence (based on model self-assessment, response coherence, and output quality signals)
- If confidence falls below a threshold, re-route to a premium model (e.g., Opus, o3)
- Return the premium response to the user
This saves cost on simple requests while maintaining quality on hard ones. The confidence threshold is tunable.
Request ──→ Cheap Model ──→ Confidence Check │ ≥ threshold → Return cheap response < threshold → Re-route to premium model → Return premium responseHistory
Section titled “History”The Router logs every routing decision and its outcome:
- Which model was selected and why (score breakdown)
- Whether the request succeeded or was retried
- Response latency and token counts
- User feedback signals (if available)
This history feeds the Elo system, the KNN index, and provides observability into routing behavior.
Cold-Start Guards
Section titled “Cold-Start Guards”When PRX starts fresh with no history, the Router falls back to sensible defaults:
- Elo ratings initialize to 1500 for all models
- KNN index is empty, so
similaritycontributes zero to the score - Capability matching and cost/latency become the dominant factors
- A configurable
default_modelis used when scores are tied
As the system accumulates history, the dynamic components (Elo, KNN) gradually take over from static heuristics.
Configuration
Section titled “Configuration”[router]default_model = "anthropic/claude-sonnet-4-20250514"
# Scoring weightsalpha = 0.25 # Semantic similaritybeta = 0.30 # Capability matchgamma = 0.20 # Elo ratingdelta = 0.15 # Cost penaltyepsilon = 0.10 # Latency penalty
# Automixautomix_enabled = trueautomix_cheap_model = "anthropic/claude-haiku-4-20250414"automix_premium_model = "anthropic/claude-sonnet-4-20250514"automix_confidence_threshold = 0.7
# Fallback chainfallback_chain = [ "anthropic/claude-sonnet-4-20250514", "openai/gpt-4o", "google/gemini-2.5-pro",]