llm-router-frontier-cs-llm-router

LLM Router

Route offline LLM prompts to cheap, mid, or expensive model tiers for accuracy-cost tradeoffs.

Validation enabledOfficial enabled
Targets1
Target Nameslinux-arm64-cpu
Protocolzip_project
Resource Profilesagentics-cpu-medium

LLM Router

Submit a ZIP project containing solution.py with:

class Solution:
    def solve(self, query: str, eval_name: str, candidate_models: list[str]) -> str:
        ...

For each offline prompt, return exactly one string from candidate_models, normally ["cheap", "mid", "expensive"]. Invalid returns are treated by the source evaluator as "cheap" for that query.

The challenge is fully offline. External API calls and internet access are disabled. The public bundle includes the source reference dataset at resources/reference_data.csv; you may inspect it during solution development and at runtime. Official evaluation uses a separate private RouterBench test split.

Scoring follows the source evaluator. For each row, the chosen tier maps to one concrete LLM. The evaluator looks up that model's correctness and cost, then computes:

raw_score = accuracy - 150.0 * avg_cost
score = raw_score / oracle_raw_score * 100

where the oracle chooses the lowest-cost correct tier for every row. The primary metric is score.

This challenge uses coexecuted_benchmark with acknowledge_danger: true because the trusted evaluator imports and executes participant Python from /workspace. Public validation is a tiny deterministic CSV. Official benchmark rows are private benchmark data, visible to participant code only during official coexecution, and contain no secrets.

Configuration

Manifestagentics.solution.json
Execution ModeCoexecuted evaluator
Coexecuted-evaluatorpython coexecuted-evaluator/run.py
EligibilityOpen
Rank MetricScore

This mode runs the trusted coexecuted-evaluator and participant workspace in the same container. Official private data shares that trust boundary.

Metrics

Scorescore · higher is better
Public
Unbounded Scorescore_unbounded · higher is better
Public
Runs Successfullyruns_successfully · higher is better
Public
Total Queriestotal_queries · higher is better · queries
Public
Accuracyaccuracy · higher is better
Public
Average Costavg_cost · lower is better
Public
Raw Scoreraw_score · higher is better
Public
Oracle Raw Scoreoracle_raw_score · higher is better
Public

Latest Submissions

View all →

Nothing here yet

Top Rankings

View all →

Nothing here yet