Close Menu
    Trending
    • The end of the ‘good enough’ worker
    • Can Apple and Google stop children from sharing explicit images?
    • Amsterdam Bans Meat Ads As The War On Food Expands
    • Katie Holmes And Joshua Jackson Spark ‘Soul-Level’ Love Chatter
    • Singapore Airlines, Southwest Airlines partner to expand access to nearly 120 US destinations
    • Trump warns Netanyahu: ‘You’ll be on your own’ if attacks on Iran continue | US-Israel war on Iran News
    • Cristiano Ronaldo, ‘The Bosnian Diamond’ headline the World Cup 40-and-over club
    • How housing market inventory is shifting across every state
    Benjamin Franklin Institute
    Tuesday, June 9
    • Home
    • Politics
    • Business
    • Science
    • Technology
    • Arts & Entertainment
    • International
    Benjamin Franklin Institute
    Home»Technology»AI Math Benchmarks: AI’s Growing Capabilities
    Technology

    AI Math Benchmarks: AI’s Growing Capabilities

    Team_Benjamin Franklin InstituteBy Team_Benjamin Franklin InstituteFebruary 25, 2026No Comments5 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
    Share
    Facebook Twitter Pinterest Email Copy Link

    Mathematics is often regarded as the ideal domain for measuring AI progress effectively. Math’s step-by-step logic is easy to track, and its definitive automatically verifiable answers remove any human or subjective factors. But AI systems are improving at such a pace that math benchmarks are struggling to keep up.

    Way back in November 2024, non-profit research organization Epoch AI quietly released Frontier Math. A standardized, rigorous benchmark, Frontier Math was designed to measure the mathematical reasoning capabilities of the latest AI tools.

    “It’s a bunch of really hard math problems,” explains Greg Burnham, Epoch AI Senior Researcher. “Originally, it was 300 problems that we now call tiers 1–3, but having seen AI capabilities really speed up, there was a feeling that we had to run to stay ahead, so now there’s a special challenge set of extra carefully constructed problems that we call tier 4.”

    To a rough approximation, tiers 1–4 go from advanced undergraduate through to early postdoc level mathematics. When introduced, state-of-the-art AI models were unable to solve more than 2% of the problems Frontier Math contained. Fast forward to today and the best publicly available AI models, such as ChatGPT 5.2 Pro and Claude Opus 4.6, are solving over 40% of Frontier Math’s 300 tiers 1–3 problems, and over 30% of the 50 tier 4 problems.

    AI takes on PhD level mathematics

    And this dizzying pace of advancement is showing no signs of abating. For example, just recently Google DeepMind announced that Aletheia, an experimental AI system derived from Gemini Deep Think, achieved publishable PhD level research results. Though obscure mathematically—calculating certain structure constants in arithmetic geometry called eigenweights—the result is significant in terms of AI development.

    “They’re claiming it was essentially autonomous, meaning a human wasn’t guiding the work, and it’s publishable,” Burnham says. “It’s definitely at the lower end of the spectrum of work that would get a mathematician excited, but it’s new—it’s something we truly haven’t really seen before.”

    To place this achievement in context, every Frontier Math problem has a known answer that a human has derived. Though a human could probably have achieved Aletheia’s result “if they sat down and steeled themselves for a week,” says Burnham, no human had ever done so.

    Aletheia’s results and other recent achievements by AI mathematicians point to new, tougher benchmarks being needed to understand AI capabilities, and fast, because existing ones will soon become irrelevant. “There are easier math benchmarks that are already obsolete, several generations of them,” says Burnham. “Frontier Math will probably saturate [meaning state-of-the-art AI models score 100%] within the next two years; could be faster.”

    The First Proof challenge

    To begin to address this problem, on February 6, a group of 11 highly distinguished mathematicians proposed the First Proof challenge, a set of 10 extremely difficult math questions which arose naturally in the authors’ research processes, and whose proofs are roughly five pages or less and had not been shared with anyone. The First Proof challenge was a preliminary effort to assess the capabilities of AI systems in solving research-level math questions on their own.

    Generating serious buzz in the math community, professional and amateur mathematicians, and teams including OpenAI, all stepped up to the challenge. But by the time the authors posted the proofs on February 14, no one had submitted correct solutions to all 10 problems.

    In fact, far from it. The authors themselves only solved two of the 10 problems using Gemini 3.0 Deep Think and ChatGPT 5.2 Pro. And most outside submissions fared little better, apart from OpenAI. With “limited human supervision” OpenAI’s most advanced internal AI system solved five of the 10 problems—a result met with a spectrum of emotions by different members of the mathematics community, from awe to disappointment. The team behind First Proof plans an even tougher second round on March 14.

    A new frontier for AI

    “I think First Proof is terrific: it’s as close as you could realistically get to putting an AI system in the shoes of a mathematician,” says Burnham. Though he admires how First Proof tests AI’s mathematical utility for a wide range of mathematics and mathematicians, Epoch AI has its own new approach to testing—Frontier Math: Open Problems. Uniquely, the pilot benchmark consists of 14 open problems (with more to follow) from research mathematics that professional mathematicians have tried and failed to solve. Since Open Problems’ release on January 27, none have been solved by an AI.

    “With Open Problems, we’ve tried to make it more challenging,” says Burnham. “The baseline on its own would be publishable, at least in a specialty journal.” What’s more, each question is designed so that it can be automatically graded. “This is a bit counterintuitive,” Burnham adds. “No one knows the answers, but we have a computer program that will be able to judge whether the answer is right or not.”

    Burnham sees First Proof and Open Problems as being complementary. “I would say understanding AI capabilities is a more-the-merrier situation,” he adds. “AI has gotten to the point where it’s, in some ways, better than most PhD students, so we need to pose problems where the answer would be at least moderately interesting to some human mathematicians, not because AI was doing it, but because it’s mathematics that human mathematicians care about.”

    From Your Site Articles

    Related Articles Around the Web



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link

    Related Posts

    Technology

    IEEE Celebrates Technology’s Brightest at Annual Event

    June 8, 2026
    Technology

    50 Years of The Institute

    June 5, 2026
    Technology

    What It Takes for Future-Ready Power Distribution

    June 4, 2026
    Technology

    7 Ways New Engineers Can Flourish in the Age of AI

    June 3, 2026
    Technology

    Tech Life – Microsoft’s big quantum bet

    June 2, 2026
    Technology

    Direct-to-Cell Technology: Enabling Satellite Connectivity for Legacy Devices

    June 2, 2026
    Editors Picks

    WeTransfer says files not used to train AI after backlash

    July 15, 2025

    Ukrainian President Zelenskyy invites Putin to Kyiv for talks | Russia-Ukraine war News

    January 30, 2026

    The ‘NBA Lottery Era No. 1 picks’ quiz

    January 22, 2026

    The universe could have 18 possible shapes

    May 24, 2026

    Money launderer in US crypto theft ring allegedly led by Singaporean Malone Lam sentenced to 70 months’ jail

    April 25, 2026
    About Us
    About Us

    Welcome to Benjamin Franklin Institute, your premier destination for insightful, engaging, and diverse Political News and Opinions.

    The Benjamin Franklin Institute supports free speech, the U.S. Constitution and political candidates and organizations that promote and protect both of these important features of the American Experiment.

    We are passionate about delivering high-quality, accurate, and engaging content that resonates with our readers. Sign up for our text alerts and email newsletter to stay informed.

    Latest Posts

    The end of the ‘good enough’ worker

    June 9, 2026

    Can Apple and Google stop children from sharing explicit images?

    June 9, 2026

    Amsterdam Bans Meat Ads As The War On Food Expands

    June 9, 2026

    Subscribe for Updates

    Stay informed by signing up for our free news alerts.

    Paid for by the Benjamin Franklin Institute. Not authorized by any candidate or candidate’s committee.
    • Privacy Policy
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.