C++23 · header-only · MIT · sha 5e209d4

[ sys ]my-stlconfirmed in repo

A standard librarythat self-hosts.

I rebuilt 28 STL containers from scratch in header-only C++23, each layered on the ones beneath it the way a real standard library is. map rides a red-black tree. lru_cache is a list plus a hash map. Every speed claim on this page is a derived artifact, read straight from the committed benchmark CSV, with the honest losses left in.

0.230×
flat_set build+find vs std::set. Contiguous storage, 4.4× faster on a sorted-build / shuffled-lookup workload.: 28containers
Across 5 categories, self-hosting where it earns its keep. ~6,808 LOC of headers.: 55tests
Catch2 across 24 files. CI on 3 OSes with -Werror.

[ problem ]

Most STL clones are a pile of parts.

Reimplementing the STL is a rite of passage, so the internet is full of them. Almost all share two flaws. The containers stand alone, twenty copies of the same growth and node-allocation logic, none of them composing. And the speed is asserted, never shown: a README says fast and you take it on faith.

The harder version is harder on both axes. Build a library where the higher-level containers are actually implemented on top of lower-level ones, the way a real standard library stacks map on a tree and priority_queue on a heap. Then make every speed claim reproducible: a number on the page should trace to a committed run, not get typed in by hand. Get both right and the project stops being a toy and starts being evidence that I know how a standard library is put together, and how to prove it goes fast.

[ approach ]

Layer the containers. Generate the numbers.

Two decisions carry the project: containers self-host on each other, and every benchmark number is a build artifact, not a claim.

Everything is header-only. Each container is a .hpp declaration paired with a .tpp of template definitions, wired through CMake paths. That makes the self-hosting visible: when map wants a balanced tree it includes the rb-tree header instead of copying 670 lines of rotation logic. The dependency graph in §3 is read straight out of those include directives, not drawn by hand.

self-hosting

Built on each other

Higher-level containers reuse lower-level storage instead of duplicating it. The red-black tree backs both map and set; lru_cache is a list plus an unordered_map; the hash map itself chains a forward_list over a vector of buckets.

reproducible

Numbers, not adjectives

A microbenchmark harness compares each container to its closest std:: equivalent. A Python pipeline turns raw runs into a committed CSV/JSON summary plus the light and dark SVG charts, so the README table is generated from the data and cannot drift from it.

[ architecture ]

The self-hosting graph.

Every edge below is a real #include read out of the headers. Storage primitives sit at the base; everything above them is composed, not re-implemented. Four foundation containers (vector, rb-tree, list, forward_list) carry the whole structure.

foundation (raw storage)composed on a my-stl containerinclude edge (verified)

verified include graph · every arrow is a real #include @ sha 5e209d4

[ tradeoffs ]

What I gave up, on purpose.

The benchmark mix is deliberately honest. The wins are where the data structure earns them; the losses stay on the page, because pretending a from-scratch node-based tree beats libc++ would be the tell of someone who does not measure.

Flat containers trade insert cost for cache locality, and win big
flat_map and flat_set store sorted keys in a contiguous vector and binary-search with lower_bound. That makes insert and erase O(n), stated plainly in the container docs. The payoff on a sorted-build / shuffled-lookup workload is 0.378× and 0.230× of node-based std::map and std::set (2.6× to 4.4× faster), because the CPU loves contiguous memory.
The tree-based containers are 12 to 18% slower than the system STL
map lands at 1.122×, set at 1.136×, and unordered_map at 1.183× of the system library. That is the expected gap for a from-scratch red-black tree against a decade-tuned standard library, and it is reported as a loss rather than massaged into a win. Matching a mature node allocator was never the goal; understanding the structure was.
Header-only, paid for in compile time
Splitting each container into .hpp + .tpp keeps the library drop-in with no link step, at the cost of pushing template instantiation into every translation unit. For a learning library meant to be read and reused, that trade is the right one: the self-hosting graph stays legible and the API stays std::-shaped.

[ benchmark ]

my-stl vs the system STL.

Median ns/op ratio, my-stl over std::. Lower is better; the amber line is the 1.0× parity baseline. Bars left of it are wins, bars right of it are the honest losses. All ten numbers are read from the committed benchmark CSV.

§ benchmarks · n=200,000 ops · 5 iters median · 1 warmup · Apple M2 · clang 17 · -O3

Median ns/op ratio, my-stl over std■ win·■ loss·⋮ 1.0× parity·lower is better

flat_set build + find
0.230×
flat_map build + find
0.378×
deque push_back + pop_front
0.595×
vector push_back (no reserve)
0.640×
vector push_back (reserve)
0.996×
small_vector push_back
1.064×
stable_vector push_back
1.119×
map build + find
1.122×
set build + find
1.136×
unordered_map emplace (reserve)
1.183×

source: docs/benchmarks/bench_summary.csv @ sha 5e209d4 · baseline = system C++ standard library, per-container std:: equivalent · ratio = my-stl ÷ std

The honest read: four containers beat the system STL on their target workload (0.230×, 0.378×, 0.595×, 0.640×), five trail it by single-digit-to-18%, and one ties at 0.996×. That spread is the point. A library that claimed to beat std:: across the board would be lying; this one tells you exactly where contiguous storage pays off and where a decade of allocator tuning still wins.

[ interactive · playground ]

Run the my-stl vs std comparison yourself.

The playground rebuilds this chart from the same committed CSV and lets you sort by container, filter to wins or losses, and read each row against the 1.0× baseline. Same numbers, your hands on the dials.

Open the playground

repository · sha 5e209d4 · MIT

Read the source on GitHub Benchmark methodology rb-tree (670 LOC)

A standard librarythat self-hosts.

0.230×

flat_set build+find vs std::set. Contiguous storage, 4.4× faster on a sorted-build / shuffled-lookup workload.

28containers

Across 5 categories, self-hosting where it earns its keep. ~6,808 LOC of headers.

55tests

Catch2 across 24 files. CI on 3 OSes with -Werror.

Built on each other

Numbers, not adjectives

Flat containers trade insert cost for cache locality, and win big

The tree-based containers are 12 to 18% slower than the system STL

Header-only, paid for in compile time

Run the my-stl vs std comparison yourself.

Onemoment

Built on each other

Numbers, not adjectives

Flat containers trade insert cost for cache locality, and win big

The tree-based containers are 12 to 18% slower than the system STL

Header-only, paid for in compile time

Run the my-stl vs std comparison yourself.