Benchmarks

In-the-Wild Distribution Shifts (WILDS)

A benchmark suite of real-world datasets designed to evaluate model robustness under distribution shift.

Overview

WILDS is a collection of datasets curated to test how models perform when the test distribution differs from training. It focuses on distribution shifts that commonly occur in the wild, including temporal changes, geographic shifts, domain changes, and subpopulation imbalance. It emphasizes out-of-distribution evaluation by design, requiring models to generalize across predefined environments rather than relying on random train/test splits.

The suite spans multiple modalities and tasks (e.g., image, text, tabular, and molecular prediction tasks). Each dataset provides explicit metadata that defines domains or groups, so evaluations can measure both average performance and worst-group behavior under shift.

Dataset Specification

WILDS is not a single dataset. It is a suite of datasets, each with its own schema, label space, and shift definition. Common properties include:

Predefined train/validation/test splits by domain or time.
Metadata fields that specify groups (e.g., hospital, year, region, source).
Task-specific labels and evaluation protocols.

Several WILDS datasets also include unlabeled splits intended for semi-supervised learning and domain adaptation, enabling evaluation of methods that leverage additional unlabeled data from shifted test domains.

Input Format

Inputs vary by dataset. In general, WILDS datasets provide input features, a label, and a metadata group identifier for shift-aware evaluation.

{
  "input": "<microscope image patch tensor>",  
  "label": 1,                     // 1 = tumor present, 0 = tumor absent
  "metadata": {
    "hospital_id": "hospital_003", // group/domain identifier
    "slide_id": "WSI_1298"
  }
}

Output Format

Outputs depend on the task (classification, regression, ranking). A typical classification output is a predicted class or score per example.

{
  "prediction_score": 0.83  // probability of tumor present
}

During evaluation, prediction scores are aggregated by metadata groups (e.g., hospital_id) to compute average and worst-group performance.

Metrics

Metrics are dataset-specific. Common metrics include accuracy, F1, AUROC, or mean squared error. WILDS emphasizes robustness under distribution shift by reporting both average performance across the test set and worst-group performance across predefined subpopulations or environments.
Optional: per-group metrics for diagnosing failures.

Known Limitations

Datasets differ significantly in schema and evaluation, which can complicate cross-dataset comparisons.
Group definitions depend on available metadata and may not capture all real-world shifts.
Some shifts are subtle, requiring large sample sizes to observe robust differences.
Evaluates robustness but does not assess calibration, clinical appropriateness, or decision utility, which must be evaluated separately.

Versioning and Provenance

WILDS results depend on the dataset version, split definitions, and group metadata used for evaluation. For reproducibility, document the dataset release, preprocessing pipeline, and any changes to group definitions or sampling.

References

WILDS Benchmark: https://wilds.stanford.edu/

GitHub: https://github.com/p-lambda/wilds

Related Benchmarks

HELM

Evaluation Suites (Multi-task / Multi-domain)

A comprehensive evaluation framework for language models that standardizes tasks, prompts, metrics, and reporting across diverse tasks, domains, and use cases.

Task-specific prompts and referencesMixed public benchmarks across domainsTask-appropriate metrics · calibration · efficiency · robustness

MMLU

Knowledge & Question Answering

Broad multi-domain benchmark with ~15k questions across 57 subjects that evaluates general knowledge and multiple-choice reasoning.

Question stem + answer optionsPublic academic/professional exam-style QAAccuracy