C++23 · header-only · MIT · sha 5e209d4
[ sys ]my-stlconfirmed in repo
A standard librarythat self-hosts.
I rebuilt 28 STL containers from scratch in header-only C++23, each layered on the ones beneath it the way a real standard library is. map rides a red-black tree. lru_cache is a list plus a hash map. Every speed claim on this page is a derived artifact, read straight from the committed benchmark CSV, with the honest losses left in.
- 0.230×
- flat_set build+find vs std::set. Contiguous storage, 4.4× faster on a sorted-build / shuffled-lookup workload.
- 28containers
- Across 5 categories, self-hosting where it earns its keep. ~6,808 LOC of headers.
- 55tests
- Catch2 across 24 files. CI on 3 OSes with
-Werror.
Most STL clones are a pile of parts.
Reimplementing the STL is a rite of passage, so the internet is full of them. Almost all share two flaws. The containers stand alone, twenty copies of the same growth and node-allocation logic, none of them composing. And the speed is asserted, never shown: a README says fast and you take it on faith.
The harder version is harder on both axes. Build a library where the higher-level containers are actually implemented on top of lower-level ones, the way a real standard library stacks map on a tree and priority_queue on a heap. Then make every speed claim reproducible: a number on the page should trace to a committed run, not get typed in by hand. Get both right and the project stops being a toy and starts being evidence that I know how a standard library is put together, and how to prove it goes fast.
Layer the containers. Generate the numbers.
Two decisions carry the project: containers self-host on each other, and every benchmark number is a build artifact, not a claim.
Everything is header-only. Each container is a .hpp declaration paired with a .tpp of template definitions, wired through CMake paths. That makes the self-hosting visible: when map wants a balanced tree it includes the rb-tree header instead of copying 670 lines of rotation logic. The dependency graph in §3 is read straight out of those include directives, not drawn by hand.
Built on each other
Higher-level containers reuse lower-level storage instead of duplicating it. The red-black tree backs both map and set; lru_cache is a list plus an unordered_map; the hash map itself chains a forward_list over a vector of buckets.
Numbers, not adjectives
A microbenchmark harness compares each container to its closest std:: equivalent. A Python pipeline turns raw runs into a committed CSV/JSON summary plus the light and dark SVG charts, so the README table is generated from the data and cannot drift from it.
The self-hosting graph.
Every edge below is a real #include read out of the headers. Storage primitives sit at the base; everything above them is composed, not re-implemented. Four foundation containers (vector, rb-tree, list, forward_list) carry the whole structure.
verified include graph · every arrow is a real #include @ sha 5e209d4
What I gave up, on purpose.
The benchmark mix is deliberately honest. The wins are where the data structure earns them; the losses stay on the page, because pretending a from-scratch node-based tree beats libc++ would be the tell of someone who does not measure.
Flat containers trade insert cost for cache locality, and win big
flat_mapandflat_setstore sorted keys in a contiguousvectorand binary-search withlower_bound. That makes insert and erase O(n), stated plainly in the container docs. The payoff on a sorted-build / shuffled-lookup workload is 0.378× and 0.230× of node-basedstd::mapandstd::set(2.6× to 4.4× faster), because the CPU loves contiguous memory.The tree-based containers are 12 to 18% slower than the system STL
maplands at 1.122×,setat 1.136×, andunordered_mapat 1.183× of the system library. That is the expected gap for a from-scratch red-black tree against a decade-tuned standard library, and it is reported as a loss rather than massaged into a win. Matching a mature node allocator was never the goal; understanding the structure was.Header-only, paid for in compile time
Splitting each container into.hpp+.tppkeeps the library drop-in with no link step, at the cost of pushing template instantiation into every translation unit. For a learning library meant to be read and reused, that trade is the right one: the self-hosting graph stays legible and the API staysstd::-shaped.
my-stl vs the system STL.
Median ns/op ratio, my-stl over std::. Lower is better; the amber line is the 1.0× parity baseline. Bars left of it are wins, bars right of it are the honest losses. All ten numbers are read from the committed benchmark CSV.
- flat_set build + find
- flat_map build + find
- deque push_back + pop_front
- vector push_back (no reserve)
- vector push_back (reserve)
- small_vector push_back
- stable_vector push_back
- map build + find
- set build + find
- unordered_map emplace (reserve)
std:: across the board would be lying; this one tells you exactly where contiguous storage pays off and where a decade of allocator tuning still wins.[ interactive · playground ]
Run the my-stl vs std comparison yourself.
The playground rebuilds this chart from the same committed CSV and lets you sort by container, filter to wins or losses, and read each row against the 1.0× baseline. Same numbers, your hands on the dials.