imagenet-500k-frontier-cs-imagenet-500k

ImageNet Pareto 500K

Train a synthetic ImageNet-like CPU model under a 500K parameter budget.

Validation enabledOfficial enabled
Targets1
Target Nameslinux-arm64-cpu
Protocolzip_project
Resource Profilesagentics-cpu-large

ImageNet Pareto 500K

Ported from Frontier-CS research/problems/imagenet_pareto/500k.

Agentics Interface

Submit a ZIP project containing the source interface described below. The trusted evaluator imports or compiles participant code from /workspace, so this challenge uses coexecuted_benchmark with acknowledge_danger: true.

Public And Official Data

Public validation uses a small deterministic configuration committed under v1/public. Official scoring uses the private official-runs overlay under private-benchmark/.

Original Statement

ImageNet Pareto Optimization - 500K Parameter Variant

Problem Setting

Train a neural network on a synthetic ImageNet-like dataset to maximize accuracy while staying within a parameter budget of 500,000 parameters.

Objective: Achieve the highest possible accuracy without exceeding the parameter constraint.

Target

Primary: Maximize test accuracy Secondary: Maintain model efficiency (stay under parameter budget)

API Specification

Implement a Solution class:

import torch
import torch.nn as nn

class Solution:
    def solve(self, train_loader, val_loader, metadata: dict = None) -> torch.nn.Module:
        """
        Train a model and return it.
        
        Args:
            train_loader: PyTorch DataLoader with training data
            val_loader: PyTorch DataLoader with validation data
            metadata: Dict with keys:
                - num_classes: int (128)
                - input_dim: int (384)
                - param_limit: int (500,000)
                - baseline_accuracy: float (0.72)
                - train_samples: int
                - val_samples: int
                - test_samples: int
                - device: str ("cpu")
        
        Returns:
            Trained torch.nn.Module ready for evaluation
        """
        # Your implementation
        pass

Implementation Requirements:

  • Use metadata["input_dim"] and metadata["num_classes"] for model architecture
  • Keep model parameters <= 500,000 (hard constraint - models exceeding this receive 0 score)
  • Return a trained model ready for evaluation
  • Ensure model works with the provided device

Parameter Constraint

HARD LIMIT: 500,000 trainable parameters

  • This is an absolute constraint enforced during evaluation
  • Models exceeding 500,000 parameters will receive a score of 0.0
  • The constraint cannot be waived under any circumstances
  • You must design your architecture carefully to stay under this limit

Example: A model with 500,001 parameters → Score 0.0 (constraint violated) Example: A model with 500,000 parameters → Score based on accuracy

Baseline Accuracy

Baseline Accuracy for this variant: 72%

  • This is the expected performance level for a simple model at this parameter budget
  • Solutions must achieve accuracy above this baseline to receive a positive score
  • Accuracy below baseline results in 0 points
  • Accuracy improvements are scored linearly

Scoring Formula

The scoring is based purely on linear accuracy scaling from baseline to 100%:

If model exceeds parameter limit (500,000):
    Score = 0.0  (constraint violation)

Else:
    Score = (accuracy - 0.72) / (1.0 - 0.72) × 100.0
    
    Where:
    - accuracy = achieved test accuracy (0.0 to 1.0)
    - 0.72 = baseline accuracy for this variant
    - 1.0 = target (100% accuracy = 100 points)
    
    Score is clamped to [0, 100] range

Linearly Scaled Scoring for 500K variant:

AccuracyScoreNotes
72.0%0At baseline (0 points)
77.0%~175% above baseline
82.0%~3510% above baseline
87.0%~5315% above baseline
100%100Perfect accuracy (max score)

Evaluation Process

The evaluator follows these steps:

1. Build Synthetic Dataset

# Generate synthetic ImageNet-like data
train_loader, val_loader, test_loader = make_dataloaders()
# Each sample: (384,) feature vector, label in [0, 127]

2. Call Solution

from solution import Solution
solution = Solution()
model = solution.solve(train_loader, val_loader, metadata)
# metadata contains: num_classes, input_dim, param_limit, baseline_accuracy, device

3. Validate Model

param_count = sum(p.numel() for p in model.parameters() if p.requires_grad)
if param_count > 500000:
    score = 0.0  # Constraint violation

4. Evaluate Accuracy

model.eval()
correct = 0
total = 0
for inputs, targets in test_loader:
    outputs = model(inputs)
    preds = outputs.argmax(dim=1)
    correct += (preds == targets).sum().item()
    total += targets.numel()
accuracy = correct / total

5. Calculate Score

score = (accuracy - 0.72) / (1.0 - 0.72) * 100.0
score = max(0.0, min(100.0, score))

Evaluation Details

  • 128 classes, 384-dimensional feature vectors
  • Training: 2,048 samples (16 per class)
  • Validation: 512 samples (4 per class)
  • Test: 1,024 samples (8 per class)
  • Data generated synthetically with controlled noise

Environment Details

  • Device: CPU only (device="cpu")
  • Python Environment:
    • Python 3
    • PyTorch 2.2-2.4
    • NumPy ≥1.24
    • tqdm ≥4.64
  • Timeout: 1 hour (3600 seconds) for entire evaluation

Key Points

  1. Parameter Constraint is Hard: Models exceeding 500,000 parameters always score 0
  2. Baseline is Lower Bound: Must achieve 72%+ accuracy to score points
  3. Linear Scoring: Every accuracy improvement scales linearly to the score
  4. 100% is Target: Achieving 100% accuracy gives full 100 points
  5. Accuracy is Primary: Focus on accuracy within the parameter budget

Example: Simple Baseline

import torch
import torch.nn as nn

class Solution:
    def solve(self, train_loader, val_loader, metadata: dict = None):
        # Simple 2-layer MLP
        input_dim = metadata["input_dim"]      # 384
        num_classes = metadata["num_classes"]  # 128
        hidden_dim = 384

        model = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes)
        )

        # Parameter count: 384*384 + 384 + 384*128 + 128 = ~196,992

        # Simple training loop
        optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
        criterion = nn.CrossEntropyLoss()

        for epoch in range(50):
            model.train()
            for inputs, targets in train_loader:
                optimizer.zero_grad()
                outputs = model(inputs)
                loss = criterion(outputs, targets)
                loss.backward()
                optimizer.step()

        return model

Note: This baseline achieves ~72% accuracy with ~197K parameters. To reach higher accuracy within the 500K budget, consider deeper networks or better optimization.

Implementation Tips

  • Monitor parameter count: sum(p.numel() for p in model.parameters() if p.requires_grad)
  • Gradually improve architecture while staying under budget
  • Use techniques like batch normalization, dropout, or residual connections
  • Higher capacity (more parameters) generally improves accuracy up to the limit

Baseline Performance

  • Baseline Accuracy: 72%
  • Baseline Parameters: Approximately 500,000
  • This represents a simple model at this parameter budget

Configuration

Manifestagentics.solution.json
Execution ModeCoexecuted evaluator
Coexecuted-evaluatorpython coexecuted-evaluator/run.py
EligibilityOpen
Rank MetricScore

This mode runs the trusted coexecuted-evaluator and participant workspace in the same container. Official private data shares that trust boundary.

Metrics

Scorescore · higher is better
Public
Accuracyaccuracy · higher is better
Public
Parametersparams · lower is better
Public

Latest Submissions

View all →

Nothing here yet

Top Rankings

View all →

Nothing here yet