AI Coding Degrades: Silent Failures Emerge

In recent months, I’ve noticed a troubling trend with AI coding assistants. After two years of steady improvements, over the course of 2025, most of the core models reached a quality plateau, and more recently, seem to be in decline. A task that might have taken five hours assisted by AI, and perhaps ten hours without it, is now more commonly taking seven or eight hours, or even longer. It’s reached the point where I am sometimes going back and using older versions of large language models (LLMs).

I use LLM-generated code extensively in my role as CEO of Carrington Labs, a provider of predictive-analytics risk models for lenders. My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop. We use them to extract useful features for model construction, a natural-selection approach to feature development. This gives me a unique vantage point from which to evaluate coding assistants’ performance.

Newer models fail in insidious ways

Until recently, the most common problem with AI coding assistants was poor syntax, followed closely by flawed logic. AI-created code would often fail with a syntax error or snarl itself up in faulty structure. This could be frustrating: the solution usually involved manually reviewing the code in detail and finding the mistake. But it was ultimately tractable.

However, recently released LLMs, such as GPT-5, have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.

As any developer will tell you, this kind of silent failure is far, far worse than a crash. Flawed outputs will often lurk undetected in code until they surface much later. This creates confusion and is far more difficult to catch and fix. This sort of behavior is so unhelpful that modern programming languages are deliberately designed to fail quickly and noisily.

A simple test case

I’ve noticed this problem anecdotally over the past several months, but recently, I ran a simple yet systematic test to determine whether it was truly getting worse. I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.

df = pd.read_csv(‘data.csv’)
df[‘new_column’] = df[‘index_value’] + 1 #there is no column ‘index_value’

Obviously, this code would never run successfully. Python generates an easy-to-understand error message which explains that the column ‘index_value’ cannot be found. Any human seeing this message would inspect the dataframe and notice that the column was missing.

I sent this error message to nine different versions of ChatGPT, primarily variations on GPT-4 and the more recent GPT-5. I asked each of them to fix the error, specifying that I wanted completed code only, without commentary.

This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem. I ran ten trials for each model, and classified the output as helpful (when it suggested the column is probably missing from the dataframe), useless (something like just restating my question), or counterproductive (for example, creating fake data to avoid an error).

GPT-4 gave a useful answer every one of the 10 times that I ran it. In three cases, it ignored my instructions to return only code, and explained that the column was likely missing from my dataset, and that I would have to address it there. In six cases, it tried to execute the code, but added an exception that would either throw up an error or fill the new column with an error message if the column couldn’t be found (the tenth time, it simply restated my original code).

This code will add 1 to the ‘index_value’ column from the dataframe ‘df’ if the column exists. If the column ‘index_value’ does not exist, it will print a message. Please make sure the ‘index_value’ column exists and its name is spelled correctly.”,

GPT-4.1 had an arguably even better solution. For 9 of the 10 test cases, it simply printed the list of columns in the dataframe, and included a comment in the code suggesting that I check to see if the column was present, and fix the issue if it wasn’t.

GPT-5, by contrast, found a solution that worked every time: it simply took the actual index of each row (not the fictitious ‘index_value’) and added 1 to it in order to create new_column. This is the worst possible outcome: the code executes successfully, and at first glance seems to be doing the right thing, but the resulting value is essentially a random number. In a real-world example, this would create a much larger headache downstream in the code.

df = pd.read_csv(‘data.csv’)
df[‘new_column’] = df.index + 1

I wondered if this issue was particular to the gpt family of models. I didn’t test every model in existence, but as a check I repeated my experiment on Anthropic’s Claude models. I found the same trend: the older Claude models, confronted with this unsolvable problem, essentially shrug their shoulders, while the newer models sometimes solve the problem and sometimes just sweep it under the rug.

Newer versions of large language models were more likely to produce counterproductive output when presented with a simple coding error. Jamie Twiss

Garbage in, garbage out

I don’t have inside knowledge on why the newer models fail in such a pernicious way. But I have an educated guess. I believe it’s the result of how the LLMs are being trained to code. The older models were trained on code much the same way as they were trained on other text. Large volumes of presumably functional code were ingested as training data, which was used to set model weights. This wasn’t always perfect, as anyone using AI for coding in early 2023 will remember, with frequent syntax errors and faulty logic. But it certainly didn’t rip out safety checks or find ways to create plausible but fake data, like GPT-5 in my example above.

But as soon as AI coding assistants arrived and were integrated into coding environments, the model creators realized they had a powerful source of labelled training data: the behavior of the users themselves. If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right. If the user rejected the code, or if the code failed to run, that was a negative signal, and when the model was retrained, the assistant would be steered in a different direction.

This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data. AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.

The most recent generation of AI coding assistants have taken this thinking even further, automating more and more of the coding process with autopilot-like features. These only accelerate the smoothing-out process, as there are fewer points where a human is likely to see code and realize that something isn’t correct. Instead, the assistant is likely to keep iterating to try to get to a successful execution. In doing so, it is likely learning the wrong lessons.

I am a huge believer in artificial intelligence, and I believe that AI coding assistants have a valuable role to play in accelerating development and democratizing the process of software creation. But chasing short-term gains, and relying on cheap, abundant, but ultimately poor-quality training data is going to continue resulting in model outcomes that are worse than useless. To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code. Otherwise, the models will continue to produce garbage, be trained on that garbage, and thereby produce even more garbage, eating their own tails.

From Your Site Articles

How This Former Roboticist’s Students Rebuilt ENIAC

How AI Is Changing Cybersecurity

Ham Radio Brings Teletext Back to Life

Energy in Motion: Unlocking the Interconnected Grid of Tomorrow

Tech Life – A hologram to remember: Pam and Bill’s love story

Engineering Manager Vs IC: How to Choose With Clarity

How to spot the lunar X and V

Flu virus tracker: Latest update shows high influenza activity across most states. Watch out for these symptoms

Kelce expected to return, but Chiefs need another answer for passing attack

Triston Casas injury may solve major Red Sox problem

US ‘not concerned’ by reports Russia aiding Iran’s targeting

Latest Posts

Pentagon Requests $54 Billion For AI War

Clavicular Hit With New YouTube Crackdown

Beijing’s new supply chain rules deepen concerns for US firms in China

Newer models fail in insidious ways

A simple test case

Garbage in, garbage out

Related Posts

Subscribe for Updates