[ ai ]llmathsha c0f450edocumented-only

A proof assistantthat decides when to compute.

I paired a fine-tuned DeepSeek-Math 7B with FAISS retrieval over NaturalProofs and a SymPy tool registry, then let one autonomous ReAct loop decide when to reach for each. A documented ablation puts it at 77.6% on TheoremQA, up from a 41.2% prompted baseline.

autonomous ReAct on TheoremQA: 77.6%
autonomous ReAct on TheoremQA
prompted baseline, no tools: 41.2%
prompted baseline, no tools
fewer tokens per correct answer: ~36%
fewer tokens per correct answer
~4k LOC, one clean package: 9pkgs
~4k LOC, one clean package

▮ the honest part first

These accuracy figures ship as results.sample.json, a documented ablation rather than a recorded GPU run. I label them as sample data in the repo and here. The package architecture, the retrieval, and the tools are real and run; the benchmark numbers are documented, not measured. I would rather show you that line than bury it.

§ theoremqa · n=800 · documented sample

01the problem

A model that talks fluently about math cannot be trusted to do it.

Ask a general LLM for a derivation and it hands you something that reads like a proof: confident prose, plausible algebra, the right vocabulary. Then a sign flips, a term drops, and the conclusion is wrong. Fluent proofs are not correct proofs, and a fine-tuned model alone has no grounding in known theorems and no way to check its own algebra.

So the question I cared about was not whether a bigger model writes nicer proofs. It was: how much accuracy and token efficiency do you actually gain by handing the model a retrieval index of real theorems, verified symbolic tools, and the freedom to decide when to use them?

§ approach · src/llmath · 9 subpackages

02the approach

Three sources of ground truth behind one model.

Each layer removes a way the model can be wrong, and the agent chooses which one it needs. Retrieval grounds claims in real theorems. Symbolic tools ground the algebra in something that cannot hand-wave. The fine-tuned model supplies the reasoning that ties them together.

Retrieval
NaturalProofs embedded and indexed in a FAISS IndexFlatIP. Normalized vectors make inner product cosine, and a metadata guard refuses to load an index built for a different dataset or split.
retrieval/faiss_retriever.py
Symbolic tools
A SymPy registry of simplify, solve, diff, integrate, plus retrieve, dispatched from tool_name: args command strings. A bare expression defaults to simplify. The algebra is computed, not hand-waved.
tools/registry.py
Fine-tuned model
DeepSeek-Math 7B with a LoRA adapter over the attention and MLP modules, 4-bit capable. It supplies the proof-style reasoning that decides which tool to reach for.
inference/model_loader.py · config.py

§ architecture · agent / retrieval / tools / prompts

03the architecture

The loop is the whole idea.

The agent runs a bounded think, tool, observe, answer cycle, parsed off a strict XML protocol. I built it to degrade gracefully instead of crashing when the model breaks the format. The manual pipeline follows a fixed path; the autonomous agent emits the same tool calls itself.

manual pipeline · the system routes

input
Question, e.g. "derivative of x²·sin(x)"
retrievalretrieval/
TheoremKB → FAISS IndexFlatIP, top-k cosine
toolstools/
SymPy registry: simplify · solve · diff · integrate
promptsprompts/
Prompt builder: retrieval + tool outputs → context
modelinference/
DeepSeek-Math 7B + LoRA · generate() · 4-bit capable
output
Proof-style answer

autonomous ReAct · the model decides

for iteration in 1..max_iterations

<think>
Free-form reasoning. The model plans its next move.
<tool>
Exactly one call as tool_name: args. Generation stops on </tool>.
<observe>
System-injected tool output. Appended to context, never generated.
<answer>
Ends the loop. Generation stops on </answer>; tokens accrue per step.

fig 1 · ReActAgent.run() · left is the system-routed manual path, right is the model-driven loop, bounded by max_iterations · src/llmath/agent/ react_agent.py · parser.py · tools/registry.py

react trace · diff: x²·sin(x) · abridged from docs/react-protocol.md

<think>Retrieve the relevant theorem.</think>
<tool>retrieve: derivative of product</tool>
<observe>[T1] Product Rule: d/dx(f·g) = f'·g + f·g' ...</observe>
<think>Apply the product rule and compute the derivative.</think>
<tool>diff: x**2*sin(x)</tool>
<observe>2*x*sin(x) + x**2*cos(x)</observe>
<answer>The derivative follows from the product rule and the computation above.</answer>

§ tradeoffs · roads not taken

04tradeoffs

What I gave up, and the one I am not hiding.

won · simplicity
A strict contract over model freedom
Exactly one tag per step. Generation stops on </tool> and </answer>, the parser keeps only the last match per tag, and max_iterations bounds the run. I traded model freedom for a loop that is easy to parse and cheap to bound. A missing tag injects a corrective observation instead of crashing.
won · zero deps
Hand-built SVG charts, no matplotlib
The benchmark plots are emitted as theme-responsive SVG with embedded prefers-color-scheme CSS, so they render in GitHub light and dark with no plotting runtime. One fewer dependency, and the charts version-control as text.
won · safety
Retrieval that refuses to mismatch
The FAISS retriever persists its index and metadata to disk and guards on load: if the metadata describes a different dataset or split than the index was built for, it refuses to load rather than returning silently wrong neighbors.
▮ honest loss
The headline numbers are sample data
The accuracy ladder ships as results.sample.json with a DEMO_RESULTS fallback, so charts and docs render without a GPU. I never recorded a real TheoremQA run, and I label the numbers as sample data everywhere they appear. The architecture is real; the benchmark is documented, not measured.

§ evidence · theoremqa test split · 800 q

05the benchmark

The ablation ladder, read honestly.

Each rung adds one grounding layer against the prompted baseline, and the deltas carry the argument. Amber marks every number here as a measured fact, in this case documented sample data sourced to benchmarks/results.sample.json.

TheoremQA accuracy · higher is better

n=800 · documented sample

baselinemodel only · no tools
41.2%anchor
+ RAGretrieval only
58.1%+16.9 pp
+ manual toolsretrieval + SymPy
69.8%+11.7 pp
+ autonomousfull ReAct loop
77.6%+7.8 pp

0.912

tool-call precision

0.689

retrieval recall@k

3.4

avg iterations

~36%

fewer tokens/correct

anchored against the prompted-LLM baseline (the named baseline) · documented sample data from benchmarks/results.sample.json at sha c0f450e, not a recorded GPU run

§ source · MIT · 64 tests · 38 unit verified passing

inline demo

Step the ReAct trace yourself

A browser-only trace player runs the SymPy tools client-side, so the symbolic outputs are genuinely correct without the 7B model or a GPU.

open the playground

repository · MIT

Read the package

9 subpackages, ~4k LOC of Python, 64 test functions, a dependency-free SVG benchmark harness, and a Gradio demo. Pinned at sha c0f450e.

github.com/ShreeChaturvedi/LLMath

python · DeepSeek-Math 7B · FAISS · SymPy · LoRA · Gradio

A model that talks fluently about math cannot be trusted to do it.

<think>Retrieve the relevant theorem.</think>

<tool>retrieve: derivative of product</tool>

<observe>[T1] Product Rule: d/dx(f·g) = f'·g + f·g' ...</observe>

<think>Apply the product rule and compute the derivative.</think>

<answer>The derivative follows from the product rule and the computation above.</answer>

A model that talks fluently about math cannot be trusted to do it.

Three sources of ground truth behind one model.

Retrieval

Symbolic tools

Fine-tuned model

The loop is the whole idea.

What I gave up, and the one I am not hiding.

A strict contract over model freedom

Hand-built SVG charts, no matplotlib

Retrieval that refuses to mismatch

The headline numbers are sample data

The ablation ladder, read honestly.

Step the ReAct trace yourself

Read the package

Onemoment

A model that talks fluently about math cannot be trusted to do it.

Three sources of ground truth behind one model.

Retrieval

Symbolic tools

Fine-tuned model

The loop is the whole idea.

What I gave up, and the one I am not hiding.

A strict contract over model freedom

Hand-built SVG charts, no matplotlib

Retrieval that refuses to mismatch

The headline numbers are sample data

The ablation ladder, read honestly.

Step the ReAct trace yourself

Read the package