New We Release AutoLab: Benchmarking Frontier Models on Autonomous Research

The infrastructure where
AI learns to do real work.

Agents need more than static data. They need environments, evaluation, and expert guidance.

Diagnosis · Evaluation · Expert Data · RL Environments

Talk to us

Leading AI labs use our open-source infrastructure to make AI systems work in the real world.

and more

Benchmarks & Infrastructure

AI Can't Game

Product

Bake AI Infrastructures

From failure diagnosis to real-world training

BakeLens

Evaluation & Diagnosis

Know if your agents will fail.

Proof

Expert-Level Data

Fix them with human expert signals.

RL & Interactive

Learning Environments

Test before you deploy.

Domain Coverage

Critical Decisions

Finance & Risk-Sensitive Decision Making
Safety, Alignment, Red-Teaming

Verifiable Tasks

Coding & Software Engineering Agents
STEM & Multimodal Reasoning
Auto Research

Human Judgment

Writing, Judgment, EQ
Aesthetic & Art

BakeLab Research

The science behind reliable AI.

AutoLab: Can Models Begin to Participate in the Loops That Drive Scientific and Engineering Progress?

Auto Research Benchmark Blog Post

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Visual Aesthetics & Evaluation Blog Post

PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory

Personalization & Agentic Memory Preprint

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Agent Safety ICLR 2026

CoDA: Agentic Systems for Collaborative Data Visualization

Agentic System ICLR 2026

View All Research

The infrastructure where AI learns to do real work.