Close Menu
    Trending
    • Pentagon Requests $54 Billion For AI War
    • Clavicular Hit With New YouTube Crackdown
    • Beijing’s new supply chain rules deepen concerns for US firms in China
    • India denounces ‘hellhole’ remark shared by Trump | Donald Trump News
    • New photos of Mike Vrabel and Dianna Russini emerge
    • AI search demands a new audience playbook
    • How do earthquakes end? A seismic ‘stop sign’ could help predict earthquake risk
    • Trump Announces Cease-Fire Between Israel and Lebanon
    Benjamin Franklin Institute
    Friday, April 24
    • Home
    • Politics
    • Business
    • Science
    • Technology
    • Arts & Entertainment
    • International
    Benjamin Franklin Institute
    Home»Technology»AI Coding Degrades: Silent Failures Emerge
    Technology

    AI Coding Degrades: Silent Failures Emerge

    Team_Benjamin Franklin InstituteBy Team_Benjamin Franklin InstituteJanuary 8, 2026No Comments8 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
    Share
    Facebook Twitter Pinterest Email Copy Link


    In recent months, I’ve noticed a troubling trend with AI coding assistants. After two years of steady improvements, over the course of 2025, most of the core models reached a quality plateau, and more recently, seem to be in decline. A task that might have taken five hours assisted by AI, and perhaps ten hours without it, is now more commonly taking seven or eight hours, or even longer. It’s reached the point where I am sometimes going back and using older versions of large language models (LLMs).

    I use LLM-generated code extensively in my role as CEO of Carrington Labs, a provider of predictive-analytics risk models for lenders. My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop. We use them to extract useful features for model construction, a natural-selection approach to feature development. This gives me a unique vantage point from which to evaluate coding assistants’ performance.

    Newer models fail in insidious ways

    Until recently, the most common problem with AI coding assistants was poor syntax, followed closely by flawed logic. AI-created code would often fail with a syntax error or snarl itself up in faulty structure. This could be frustrating: the solution usually involved manually reviewing the code in detail and finding the mistake. But it was ultimately tractable.

    However, recently released LLMs, such as GPT-5, have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.

    As any developer will tell you, this kind of silent failure is far, far worse than a crash. Flawed outputs will often lurk undetected in code until they surface much later. This creates confusion and is far more difficult to catch and fix. This sort of behavior is so unhelpful that modern programming languages are deliberately designed to fail quickly and noisily.

    A simple test case

    I’ve noticed this problem anecdotally over the past several months, but recently, I ran a simple yet systematic test to determine whether it was truly getting worse. I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.

    df = pd.read_csv(‘data.csv’)
    df[‘new_column’] = df[‘index_value’] + 1 #there is no column ‘index_value’

    Obviously, this code would never run successfully. Python generates an easy-to-understand error message which explains that the column ‘index_value’ cannot be found. Any human seeing this message would inspect the dataframe and notice that the column was missing.

    I sent this error message to nine different versions of ChatGPT, primarily variations on GPT-4 and the more recent GPT-5. I asked each of them to fix the error, specifying that I wanted completed code only, without commentary.

    This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem. I ran ten trials for each model, and classified the output as helpful (when it suggested the column is probably missing from the dataframe), useless (something like just restating my question), or counterproductive (for example, creating fake data to avoid an error).

    GPT-4 gave a useful answer every one of the 10 times that I ran it. In three cases, it ignored my instructions to return only code, and explained that the column was likely missing from my dataset, and that I would have to address it there. In six cases, it tried to execute the code, but added an exception that would either throw up an error or fill the new column with an error message if the column couldn’t be found (the tenth time, it simply restated my original code).

    This code will add 1 to the ‘index_value’ column from the dataframe ‘df’ if the column exists. If the column ‘index_value’ does not exist, it will print a message. Please make sure the ‘index_value’ column exists and its name is spelled correctly.”,

    GPT-4.1 had an arguably even better solution. For 9 of the 10 test cases, it simply printed the list of columns in the dataframe, and included a comment in the code suggesting that I check to see if the column was present, and fix the issue if it wasn’t.

    GPT-5, by contrast, found a solution that worked every time: it simply took the actual index of each row (not the fictitious ‘index_value’) and added 1 to it in order to create new_column. This is the worst possible outcome: the code executes successfully, and at first glance seems to be doing the right thing, but the resulting value is essentially a random number. In a real-world example, this would create a much larger headache downstream in the code.

    df = pd.read_csv(‘data.csv’)
    df[‘new_column’] = df.index + 1

    I wondered if this issue was particular to the gpt family of models. I didn’t test every model in existence, but as a check I repeated my experiment on Anthropic’s Claude models. I found the same trend: the older Claude models, confronted with this unsolvable problem, essentially shrug their shoulders, while the newer models sometimes solve the problem and sometimes just sweep it under the rug.

    Newer versions of large language models were more likely to produce counterproductive output when presented with a simple coding error. Jamie Twiss

    Garbage in, garbage out

    I don’t have inside knowledge on why the newer models fail in such a pernicious way. But I have an educated guess. I believe it’s the result of how the LLMs are being trained to code. The older models were trained on code much the same way as they were trained on other text. Large volumes of presumably functional code were ingested as training data, which was used to set model weights. This wasn’t always perfect, as anyone using AI for coding in early 2023 will remember, with frequent syntax errors and faulty logic. But it certainly didn’t rip out safety checks or find ways to create plausible but fake data, like GPT-5 in my example above.

    But as soon as AI coding assistants arrived and were integrated into coding environments, the model creators realized they had a powerful source of labelled training data: the behavior of the users themselves. If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right. If the user rejected the code, or if the code failed to run, that was a negative signal, and when the model was retrained, the assistant would be steered in a different direction.

    This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data. AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.

    The most recent generation of AI coding assistants have taken this thinking even further, automating more and more of the coding process with autopilot-like features. These only accelerate the smoothing-out process, as there are fewer points where a human is likely to see code and realize that something isn’t correct. Instead, the assistant is likely to keep iterating to try to get to a successful execution. In doing so, it is likely learning the wrong lessons.

    I am a huge believer in artificial intelligence, and I believe that AI coding assistants have a valuable role to play in accelerating development and democratizing the process of software creation. But chasing short-term gains, and relying on cheap, abundant, but ultimately poor-quality training data is going to continue resulting in model outcomes that are worse than useless. To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code. Otherwise, the models will continue to produce garbage, be trained on that garbage, and thereby produce even more garbage, eating their own tails.

    From Your Site Articles

    Related Articles Around the Web



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link

    Related Posts

    Technology

    How This Former Roboticist’s Students Rebuilt ENIAC

    April 23, 2026
    Technology

    How AI Is Changing Cybersecurity

    April 23, 2026
    Technology

    Ham Radio Brings Teletext Back to Life

    April 22, 2026
    Technology

    Energy in Motion: Unlocking the Interconnected Grid of Tomorrow

    April 22, 2026
    Technology

    Tech Life – A hologram to remember: Pam and Bill’s love story

    April 21, 2026
    Technology

    Engineering Manager Vs IC: How to Choose With Clarity

    April 21, 2026
    Editors Picks

    How to spot the lunar X and V

    January 25, 2026

    Flu virus tracker: Latest update shows high influenza activity across most states. Watch out for these symptoms

    January 5, 2026

    Kelce expected to return, but Chiefs need another answer for passing attack

    March 9, 2026

    Triston Casas injury may solve major Red Sox problem

    May 3, 2025

    US ‘not concerned’ by reports Russia aiding Iran’s targeting

    March 7, 2026
    About Us
    About Us

    Welcome to Benjamin Franklin Institute, your premier destination for insightful, engaging, and diverse Political News and Opinions.

    The Benjamin Franklin Institute supports free speech, the U.S. Constitution and political candidates and organizations that promote and protect both of these important features of the American Experiment.

    We are passionate about delivering high-quality, accurate, and engaging content that resonates with our readers. Sign up for our text alerts and email newsletter to stay informed.

    Latest Posts

    Pentagon Requests $54 Billion For AI War

    April 24, 2026

    Clavicular Hit With New YouTube Crackdown

    April 24, 2026

    Beijing’s new supply chain rules deepen concerns for US firms in China

    April 24, 2026

    Subscribe for Updates

    Stay informed by signing up for our free news alerts.

    Paid for by the Benjamin Franklin Institute. Not authorized by any candidate or candidate’s committee.
    • Privacy Policy
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.