Episode

Let's do the Math!

Published on:

Nov 21, 2025

Listen on:

Sam Nadler (00:00:05): Hey everyone and welcome to Built This Week, the podcast where we share what we’re building, how we’re building it, and what it means for the world of AI and startups. I’m Sam Nadler, co-founder here at Ryz Labs.

BTW Ep022 v2

Jordan Metzner (00:00:10): And this week, I built an LM Math Roaster. You put in a math problem, it’ll score it, roast it, and we’ll go through that in just a second. You can actually submit a custom problem too—if you have your own problem, you can just drop it in. It’s awesome.

Sam Nadler (00:00:15): Each and every week I’m joined by my friend and business partner, co-host, the math roaster himself—Jordan Metzner.

Sam Nadler (00:00:20): And this week we have a special guest from Axiom Math AI: Carina. Hello Carina, and welcome to the show. If you don’t mind, give us a one-minute intro about who you are and what Axiom does.

Carina (00:00:27): Hi Sam and Jordan. I’m Carina, founder and CEO of Axiom. At Axiom, we are building self-improving, super-intelligent reasoners—starting with an AI mathematician that can do really good math. It can prove theorems, formulate new problems, and discover new mathematical objects, like graphs and fancy functions that show up in important partial differential equations.

Carina (00:00:42): Great to be here. Super exciting topic.

Sam Nadler (00:00:47): Amazing. We’re going to jump in—just to cover the high-level docket for today: first, we’re going into a cool tool that Jordan built to really put the current LLMs up against each other for math proofs. Then we’ll dive into Axiom Math and your mission and hear directly from you about what you’re doing and how it’s all coming together. And lastly, we’ll cover some of the latest and greatest in AI news—some big fundraises, as always, new opportunities, and what’s happened in the last couple days.

Sam Nadler (00:01:07): So with that, Jordan, I’d love for you to walk us through what you built with Axiom in mind.

Jordan Metzner (00:01:12): Yeah. So—hey Carina, welcome to the show.

Carina (00:01:14): Hey!

Jordan Metzner (00:01:16): Hey Sam. Another great episode, another crazy week in AI. With the news of Gemini 3 dropping this week and a bunch of other things, it’s almost never a dull moment in the AI world.

Jordan Metzner (00:01:24): When we first spoke with Carina, we talked about some ideas for building vibe-coding applications that could be cool for math, and even cool for Axiom in general. So I built an LM Math Roaster. Carina, I hope you like it. I’ll walk through how it works—it’s “productionized,” so it actually works pretty well.

Jordan Metzner (00:01:38): I’ve got it in light mode and dark mode—pretty cool, right? I have some style.

Sam Nadler (00:01:42): I like that.

Jordan Metzner (00:01:44): Dark mode is better for math, right? That feels a little “math-y” to me.

Carina (00:01:47): I’m a light-mode-for-math person.

Jordan Metzner (00:01:49): Okay, fine—light mode for math.

Jordan Metzner (00:01:52): Here’s how it works: I collected a bunch of different problems of different difficulty levels. Then I wired up different models: Gemini 2.5 Pro, GPT-5, Claude Sonnet 4.5, and Grok 4 Fast. I wanted to add Grok 4.1 and Gemini 3, but those models just came out like yesterday and I don’t have API access yet. Even 5.1 is out, but I didn’t get it in time.

Jordan Metzner (00:02:12): Before I run it, let me show you what the app does. You put in a math problem, it’ll score it, roast it, and we’ll go through that in a second. You can submit a custom problem—if you have your own, you can just drop it in. It’s awesome.

Jordan Metzner (00:02:23): We also have a history view—so for all the math problems that have run in the past, we can see how well each model performed. And then, I thought this would be fun for your team, Carina: I did a little extra and built an API.

Jordan Metzner (00:02:35): Your team can generate their own API key and, using my API link, pass in a question. In the background it runs all the models, gets their results, and then you can compare them as we go.

Jordan Metzner (00:02:44): Let’s jump into it. First—I’m okay at math, but I’m certainly nowhere near your level, Carina. So I wanted to start with something really easy just to prove the app works.

Jordan Metzner (00:02:52): So I started with: “2 + 2 = 4.” That’s about my level of math. I think I can solve that one.

Jordan Metzner (00:02:58): What I have all these models doing is: I’ve got Gemini, GPT, Claude, Grok—each writing a Lean proof. You can see the response times from each model; we can compare the results.

Jordan Metzner (00:03:09): And of course we have the problem that no model knows which model is actually the smartest. So I chose ChatGPT as the judge. ChatGPT evaluates the responses from all of the models.

Jordan Metzner (00:03:19): Let’s just get this finished with 2 + 2 and then we’ll move to a more complicated problem.

Sam Nadler (00:03:23): Great baseline test.

Jordan Metzner (00:03:25): It’s great for evaluation. It has to process each one of these against the “correct” source.

Jordan Metzner (00:03:30): Okay—ChatGPT’s feedback: all of them are correct. Great job. Fantastic. Two plus two equals four. All models know that. I think we’re in good shape.

Sam Nadler (00:03:39): So the universe is still intact.

Jordan Metzner (00:03:41): Exactly. So far, basis of truth. Right, Carina?

Carina (00:03:44): Yes, excellent baseline.

Jordan Metzner (00:03:47): I’m not a perfect scientist but I know a little bit. I have a few different problems lined up—maybe you want to choose which one we do next?

Carina (00:03:53): Let’s do Fermat’s Little Theorem.

Jordan Metzner (00:03:55): Okay, fantastic. That’s considered an intermediate problem here.

Jordan Metzner (00:03:59): Here’s the problem: For a prime number p and any integer a not divisible by p, we have a^(p−1) ≡ 1 (mod p). Provide a formal proof in Lean 4 syntax. I’ll read it out loud just for the audience.

Jordan Metzner (00:04:11): First: “Let’s roast the LLMs.”

Jordan Metzner (00:04:13): So now we’ll have the LLMs working. This one is going to take a little longer, so we can chat in the meantime.

Carina (00:04:18): It’s fascinating—there are so many famous theorems. I think when I selected it, I actually had something else in mind, but yes, I think this one is Fermat’s Little Theorem.

Jordan Metzner (00:04:27): Yeah—ChatGPT told me that when I built the app. As I was vibe-coding this thing, I’m pretty sure it’s Fermat’s Little Theorem.

Jordan Metzner (00:04:33): The good part is you can submit whatever formula you want. I tried to get a collection of about ten problems with different difficulty levels so we could test the app and see how it does. The easiest one was the Pythagorean theorem—and even that felt a little too difficult just to check “basis of truth.” That’s why I did the very basic arithmetic one, just to check if everything’s working.

Sam Nadler (00:04:52): Makes sense.

Jordan Metzner (00:04:54): And all of this is for Lean—so I’m generating Lean responses.

Jordan Metzner (00:04:57): Okay, I’m starting to see results—here we go.

Jordan Metzner (00:05:00): Here’s Gemini Pro’s response. It gives the theorem and some explanation, and then it says it will provide a Lean version.

Jordan Metzner (00:05:06): Here’s GPT-5—that one is clearly Lean.

Jordan Metzner (00:05:09): If you scroll down in Gemini’s answer, it’s below—the Lean part.

Sam Nadler (00:05:13): Okay, so we’ve got multiple Lean proofs.

Jordan Metzner (00:05:15): Right. We’ve got things like import Mathlib and the rest of the Lean code.

Jordan Metzner (00:05:18): First, we can compare the results—review them all side by side. Then we can judge the proofs. I think that’s the way to do it.

Jordan Metzner (00:05:24): We actually have two ways to judge: the app will score them automatically, or we can take one of the Lean proofs and put it directly into a Lean environment and see if it compiles. That might be more painful, but more exact.

Carina (00:05:35): Yeah, compiling in Lean would be the definitive test.

Jordan Metzner (00:05:38): Exactly. We could pick whichever answer we want, grab the Lean code, and test it.

Jordan Metzner (00:05:42): But for now, let’s just let the judging finish and see how well it does.

Jordan Metzner (00:05:47): It’s a really cool way to compare any type of multi-model output against any dataset. Here, I chose Lean and math, but this could be a lawyer writing a contract, sending it out to four models, and asking them all to bring back a contract—then you judge which is best. Lots of use cases beyond math.

Jordan Metzner (00:06:01): Okay, cool. I have a leaderboard. It gave Gemini the best score. ChatGPT got 70, and so on and so forth.

Jordan Metzner (00:06:08): The best proof, apparently, is Gemini’s. But interestingly, for one of the others it says it proved the theorem for natural numbers instead of integers—it didn’t deal with negative numbers. That’s fascinating.

Carina (00:06:18): Yeah, that happens a lot. The subtle differences in domain really matter.

Jordan Metzner (00:06:22): So Gemini scored a 98—pretty wild.

Jordan Metzner (00:06:26): I think it would be really interesting, Carina, to get thousands of problems like this and just iterate over and over, to find where these models truly fail. My guess is, for each major type of problem set, you’d identify weaknesses pretty quickly.

Carina (00:06:37): That’s actually the first thing we did. Back when the office was still just a few chairs and a plastic folding table—that was basically our initial workflow.

Jordan Metzner (00:06:45): That’s awesome.

Jordan Metzner (00:06:47): Anyway, that’s my Math Roaster. I hope you like it.

Carina (00:06:49): It’s great—super fun and useful.

Sam Nadler (00:06:52): That was super cool. Okay, let’s jump into Axiom.

Sam Nadler (00:06:55): Carina, obviously you’re a super young entrepreneur, super excited about your new round of investment—but let’s start from scratch. Tell us: how did this whole thing come about? What is Axiom, and what are you doing every day?

Carina (00:07:05): Axiom is, in my opinion, as cool as Axiom’s company sounds.

Carina (00:07:10): We’re building an AI mathematician—a model that can not only do really well on benchmark problems, like Olympiad math problems, but also develop capabilities like a research mathematician. That includes the ability to conjecture—to propose new, unseen problems and hypotheses—and then try to prove and verify them.

Carina (00:07:25): So you have a “prover” and a “conjecturer” talking to each other, forming a self-play loop, like we talked about in our vision blog.

Carina (00:07:32): Additionally, the system can formalize existing math—converting massive amounts of math corpus data into Lean, turning math into a programming language. That has a lot of advantages, because we’ve already seen reinforcement learning on verifiable domains like coding achieve incredible breakthroughs.

Carina (00:07:46): By combining AI math and programming language—those three pillars together—we think we’ll see a step-function jump in math capabilities, especially in “real mathematics” that goes beyond benchmarks most models are tested on.

Sam Nadler (00:08:00): That’s incredibly interesting.

Sam Nadler (00:08:02): Historically, math was mostly “run” on CPUs, right? So is the movement of running math equations on GPUs this big jump—the thing that unlocks new opportunities? Is that a fair way to say it?

Carina (00:08:12): It’s close, but I wouldn’t say compute is the main bottleneck. We use a lot of both GPU and CPU in our engineering. The large language model components run on GPUs, but the Lean part compiles on CPU. There’s an interesting engineering problem in how to maximize utilization, because CPU can drag things down a bit.

Carina (00:08:27): But the real bottleneck is data scarcity. There isn’t that much Lean. There are more than a trillion tokens of Python code—and only about 10 million tokens of Lean. That’s a 100,000× data gap.

Carina (00:08:39): So data is what’s really hot and difficult here.

Sam Nadler (00:08:43): So that’s where artificial data kicks in, right?

Carina (00:08:46): Exactly. Artificial or synthetic data becomes much more important: generating artificial Lean problems, converting informal math into formal Lean, augmenting existing data in a way that makes sense.

Carina (00:08:56): A lot of companies—especially some Chinese teams—are hiring massive numbers of Lean data labelers and programmers to generate human data. It’s very expensive and doesn’t scale well.

Carina (00:09:05): We put a big emphasis on synthetic data generation—converting the informal data that already exists, scraping it, and augmenting it so it can be formalized into Lean.

Carina (00:09:13): We also explore ways to take existing formal math data in Lean—the “key data”—and make it 100× or 1,000× larger.

Sam Nadler (00:09:22): That’s super cool.

Sam Nadler (00:09:24): Speaking of Lean, now that everyone has AI and math is evolving—it feels like the jump from “no calculator” to “calculator” days. Is Lean being taught broadly in math education programs now? Are students actually writing their proofs in Lean? Or is it still early?

Carina (00:09:36): Great question. A lot of universities are teaching Lean now. I know a couple in the U.S.—like University of Washington, where Professor Joel Alper in the math department teaches Lean. At CMU there’s a strong Lean community—people like Professor Jeremy Avigad, Professor Sean Welle, and others.

Carina (00:09:49): Stanford has an automatic reasoning lab with Professor Clark Barrett and others, and grad students are teaching Lean there. Many universities in Europe and the UK teach Lean as well.

Carina (00:09:58): In fact, mathlib—the big Lean math library—started as a teaching effort at Imperial College London with Professor Kevin Buzzard. He wanted to teach students in his introductory proofs class to write better, more rigorous proofs, so he gave them the exercise of formalizing everything in Lean.

Carina (00:10:11): Interestingly, it wasn’t actually the students who were behind that got most engaged—it was the advanced students, the ones already taking grad-level classes. They were bored and had nothing better to do, so they joined the effort of formalizing the whole undergrad corpus of math in mathlib.

Carina (00:10:24): Two or three of those early students are actually working with Axiom now. Some of them are my closest friends from the last five years—we were part of an MIT–Imperial College London exchange program. So it’s really great to have this generation of “Lean-native” students, who are now PhD students or have already graduated.

Sam Nadler (00:10:41): That’s awesome.

Sam Nadler (00:10:43): Okay, how do these LLMs compare to what you’re doing at Axiom? If someone is a layperson into AI and tech, they might say: “What’s the difference between this and ChatGPT? What’s the difference between a big language model and what you’re building?”

Carina (00:10:53): Large language models do math in the informal space. They produce proofs in natural language—English—and they can provide multi-step reasoning traces and numerical answers. The numerical answer is verifiable—you can check that 971 is 971.

Carina (00:11:06): But without detailed, step-by-step deduction that’s programmatically checked, it’s hard to verify all the intermediate reasoning. For a 3,000-step proof, even an expert can’t easily see if there’s a hidden bug.

Carina (00:11:17): On the other hand, the informal-formal hybrid system we’re building at Axiom outputs formal Lean programs underneath the natural language. The Lean program is executable. We don’t need an LLM judge or a human judge—we just run it like you’d run a Python program. That’s the big difference.

Carina (00:11:31): In the formal model space, Google DeepMind’s AlphaProof system is a formal system. This year they have the Gemini-based system. OpenAI has the Strawberry/IMO team—those are informal systems that did really well on IMO and Olympiad benchmarks. Those benchmarks have a kind of curriculum or syllabus that constrains the search space.

Carina (00:11:49): But as proofs get very complex and multi-step, they require really hard math capabilities. We believe formal systems will be extremely helpful to keep scaling on those target problems.

Sam Nadler (00:12:01): That’s so, so cool.

Sam Nadler (00:12:03): Obviously math is the backbone of technology, so it feels like there are mathematical breakthroughs on the edge here—that once these systems really get their power, they’ll unlock entirely new opportunities.

Carina (00:12:12): Exactly.

Sam Nadler (00:12:15): All right, Carina—thank you for telling us about Axiom. That was amazing.

Sam Nadler (00:12:18): Let’s talk a bit about the news this week—it’s been another crazy week.

Jordan Metzner (00:12:22): Yeah, Gemini 3 just dropped. I don’t know if you’ve tried it yet. I downloaded Antigravity, the new Google code editor. Lots to tell you, but I’m being throttled everywhere.

Jordan Metzner (00:12:30): I can’t get Gemini 3 to work in Cursor, I can’t get it to work in Antigravity. I burned all my tokens. I’m munching tokens all night long pushing Gemini 3.

Jordan Metzner (00:12:39): But—I’ve been incredibly impressed with its performance for coding. And its design taste is much nicer. I really like Gemini 3. It’s also much cheaper to run compared to GPT.

Carina (00:12:48): One interesting thing is: on a certain benchmark—I can’t share which yet—our model actually beats Gemini 3 in math and proofs.

Sam Nadler (00:12:55): Oh wow.

Carina (00:12:57): Gemini 3 has been outperforming almost every LLM on the market in terms of test scores. I tried using it this morning to build a personal meditation app and ran into some hiccups—maybe user error, maybe the model.

Carina (00:13:08): It got 100% on AIME 2025, which is the high-school level exam—three levels below the IMO. If you win AIME you go to USAMO and then to IMO. So 100% is pretty impressive.

Carina (00:13:19): Without tools it got 95%, so you see how important code execution and tools are. That’s a significant improvement.

Jordan Metzner (00:13:27): Yeah, super interesting.

Carina (00:13:29): There are 15 problems on the AIME exam—95% means it got one wrong. Last time, it got an 88, so it probably got two wrong without tools.

Carina (00:13:37): It’s fascinating.

Jordan Metzner (00:13:40): Gemini is super cool.

Jordan Metzner (00:13:42): I think the Gemini model is interesting for a lot of reasons. This is the first big model coming out of one of the top three hyperscalers—Amazon, Google, Microsoft—where the hyperscaler’s own model is genuinely competitive. Historically, their models were behind Claude, Anthropic, and OpenAI. This is a big leap.

Jordan Metzner (00:13:57): Google has a lot of money, a lot of servers, a lot of warehouses, and a lot of products to plug Gemini 3 Pro into—from Google Workspace to Search to YouTube to Ads. Huge opportunity.

Sam Nadler (00:14:07): Yeah.

Sam Nadler (00:14:09): And they already have 650 million monthly active users across Search and the Gemini app. That’s an insane amount of users.

Jordan Metzner (00:14:15): Insane.

Sam Nadler (00:14:17): We also saw this week that Warren Buffett made a massive investment in Google. And if you saw the Antigravity launch videos, Sergey was front and center again.

Jordan Metzner (00:14:25): Yep. CEOs getting back in the driver’s seat.

Sam Nadler (00:14:28): Speaking of CEOs back in the driver’s seat—story number two: Jeff Bezos is starting a new AI startup called Project Prometheus. He’s going to be co-CEO.

Jordan Metzner (00:14:37): To me, it feels like Amazon doesn’t really have access to a foundational model lab like Microsoft does with OpenAI, so they’re trying to build their own. Google has DeepMind, Microsoft has OpenAI, and Amazon hasn’t really had a strong stake in that part of the ecosystem.

Sam Nadler (00:14:51): Yeah, totally.

Carina (00:14:53): I think this one is focused on more physical science fairness or experimentation—kind of like reward testing for science. I’m quite excited about that.

Carina (00:15:01): It doesn’t seem exactly like an “OpenAI-style foundation model lab.” It feels more like applying theoretical findings and figuring out how to run experiments and gather data in the physical world. Hard to say exactly what they’ll do, but that’s my guess.

Jordan Metzner (00:15:14): They also brought in people from Verily—the biotech startup inside Alphabet—so it’s definitely got that science-heavy DNA.

Jordan Metzner (00:15:21): Overall, I think the trend is billionaire CEOs who could be on a beach somewhere are choosing to come back and build, because we’re in a new build era. And that’s exciting.

Sam Nadler (00:15:30): 100%.

Sam Nadler (00:15:32): Okay, last story of the day: major funding news from Suno.

Jordan Metzner (00:15:36): Yeah—Suno raised a $250 million Series C from Menlo at something like a $2.5 billion valuation.

Jordan Metzner (00:15:42): We’ve talked about them a few times on here, Sam, but I love Suno. I find it so fun. It lets you quickly generate highly-produced music from a single prompt.

Carina (00:15:51): Yeah, it’s great.

Jordan Metzner (00:15:53): Maybe to end the episode we’ll have an Axiom theme song we can layer in—Suno-generated, obviously. With a quick prompt about Carina and Axiom, we could have a pop song, a reggae song, an EDM song—you name it.

Jordan Metzner (00:16:03): We could have Axiom proving math problems in song—theoretically. I’m half-joking, but it would be fun.

Jordan Metzner (00:16:10): It’s just a tool we really like talking about. In fact, the song for Built This Week—our podcast intro—we created that with AI as well.

Carina (00:16:17): Yeah, I remember playing with Suno. It’s amazing. I don’t know anything about composition, but sometimes there’s a melody in my head I’d love to turn into reality.

Carina (00:16:25): I feel like when AI gets really, really good, people who are “hidden Ramanujans”—who have mathematical creativity but were never trained formally in proofs—might use tools like Axiom to generate actual theorems. Just like artists and musicians use AI tools to create music without classical training.

Sam Nadler (00:16:39): Totally.

Jordan Metzner (00:16:41): We covered in a previous episode that there’s a new programming language where you can program music. And music, at the end of the day, is incredibly mathematical.

Jordan Metzner (00:16:47): It’ll be interesting to see how new math inventions might lead to new styles of music.

Carina (00:16:51): Most of the humanities or social science classes that math majors at MIT take end up being music—actual music theory. It’s incredibly mathematical.

Carina (00:16:58): It doesn’t matter what instrument you play—once you dive in, you start learning about triads, triplets, chords, intervals… all the structure, all the symmetry.

Sam Nadler (00:17:07): Love that.

Sam Nadler (00:17:09): Carina, this was a great episode. Thank you so much for joining us.

Jordan Metzner (00:17:12): Yeah, this was awesome. Loved learning about Axiom, talking math, and building new tools this week.

Jordan Metzner (00:17:17): Gemini 3 has kept me up all night—I don’t know about you guys—but I’m going to keep working with it and playing with new stuff.

Sam Nadler (00:17:23): This was also a pretty big pivot from our normal format, so it’s been extremely informative. Really appreciate you coming on.

Carina (00:17:29): Thank you so much. I had a lot of fun.

Sam Nadler (00:17:32): All right, that’s it for this week. See you all next time on Built This Week.

‍

Hosted by

Sam

Hosted by

Jordan

Let's do the Math!

More podcast