Pressure-testing my own AutoML benchmark

I tried to write a benchmark for an agentic AutoML platform and the useful result was negative: the obvious metric was wrong, and there is no unified benchmark runner yet. These are the design notes that talked me out of shipping a number I could not defend, from a ~137k line TypeScript codebase with ~2,208 tests.

The metric that looked right and was not

The first idea was to score the preprocessing agent by how closely its output matched a human-written reference, using set overlap of the chosen transformations. The natural measure is Jaccard similarity over the two sets of steps $A$ (agent) and $H$ (human):

J(A, H) = \frac{|A \cap H|}{|A \cup H|}

It is clean, it is a single number between 0 and 1, and it is the wrong thing to optimize. Two preprocessing pipelines can disagree on every step and produce an equally good model: drop a column or impute it, one-hot or target-encode, standardize or leave it to a tree. Jaccard scores those as failures because the strings differ, while the thing I actually care about, the quality of the model that comes out the far end, is unchanged. A high-Jaccard agent is an agent that imitates one human's habits, not one that builds good pipelines.

What the metric should have been

The honest target is outcome, not process: held-out model quality under a fixed budget, with the human pipeline as one baseline rather than the ground truth. Something closer to

\text{score}(A) = \text{quality}\big(\text{model}(A)\big) - \text{quality}\big(\text{model}(H)\big)

measured on a holdout the agent never saw, across several datasets, at a capped time-to-model so a slow agent cannot buy quality with compute. The notes reach that conclusion and then state the harder truth: that runner does not exist in the repo yet. The quality gate in the benchmark catalog is literally set to tbd, and no run artifacts are committed. So I do not publish a quality number, because there is not one I can stand behind.

What I can measure, and do

The benchmark story is unfinished; the system around it is not. These are the numbers I report, every one counted from the code rather than narrated:

test suite: ~2,208 test cases across about 246 files (1,229 backend, 908 frontend, 71 landing), counted from the it() and test() declarations.
source size: ~137k lines of TypeScript across backend and frontend, excluding tests.
MCP registry: 20+ tools exposed to the agent over the official Model Context Protocol SDK, with human approval gates between phases.
migrations: 23 sequential SQL migrations recording the schema as it grew (auth, notebooks, workflows, experiments, pgvector embeddings, deployments).

The part I will make a hard claim about

Agent-generated Python never runs in-process. Every cell executes inside a hardened Docker container, and the flag set is in one auditable function covered by tests rather than scattered across the codebase:

--network none          // no egress by default
--read-only             // read-only root filesystem
--user sandbox          // non-root
--memory, --cpus        // resource caps
--tmpfs /tmp:nosuid     // no setuid binaries in scratch
-v datasets:ro          // datasets mounted read-only
--add-host ...          // SSRF block

backend/src/services/container/dockerBuilder.ts: every sandbox flag, in one tested place

I will defend that claim because dockerBuilder.test.ts asserts it. Network off, read-only root, non-root user, memory and CPU caps, a nosuid tmpfs, read-only dataset mounts, and an SSRF block. The difference between this and the benchmark numbers is the difference between a thing the tests prove and a thing the tests do not yet exist for.

The lesson I am keeping

The expo notes are a record of rejecting my own first design before it shipped a misleading number. That is the discipline I want the rest of the portfolio judged by: a metric is only worth publishing if it measures the outcome you claim and you can reproduce it. When it cannot, the honest move is to say so and report what you can actually count.

The architecture, the sandbox, and the design notes are at github.com/ShreeChaturvedi/auto-ml.

The metric that looked right and was not

A

(agent) and

H

(human):

J(A, H) = \frac{|A \cap H|}{|A \cup H|}

Jaccard

What the metric should have been

The honest target is outcome, not process: held-out model quality under a fixed budget, with the human pipeline as one baseline rather than the ground truth. Something closer to

\text{score}(A) = \text{quality}\big(\text{model}(A)\big) - \text{quality}\big(\text{model}(H)\big)

outcome

What I can measure, and do

The benchmark story is unfinished; the system around it is not. These are the numbers I report, every one counted from the code rather than narrated:

test suite

~2,208 test cases across about 246 files (1,229 backend, 908 frontend, 71 landing), counted from the it() and test() declarations.

source size

~137k lines of TypeScript across backend and frontend, excluding tests.

MCP registry

20+ tools exposed to the agent over the official Model Context Protocol SDK, with human approval gates between phases.

migrations

23 sequential SQL migrations recording the schema as it grew (auth, notebooks, workflows, experiments, pgvector embeddings, deployments).

The part I will make a hard claim about

--network none          // no egress by default
--read-only             // read-only root filesystem
--user sandbox          // non-root
--memory, --cpus        // resource caps
--tmpfs /tmp:nosuid     // no setuid binaries in scratch
-v datasets:ro          // datasets mounted read-only
--add-host ...          // SSRF block

backend/src/services/container/dockerBuilder.ts: every sandbox flag, in one tested place

The lesson I am keeping

The architecture, the sandbox, and the design notes are at github.com/ShreeChaturvedi/auto-ml.

The metric that looked right and was not

What the metric should have been

What I can measure, and do

The part I will make a hard claim about

The lesson I am keeping

Onemoment

The metric that looked right and was not

What the metric should have been

What I can measure, and do

The part I will make a hard claim about

The lesson I am keeping