These failures don't replicate for me at all with GPT 5.2 and Gemini 3 Pro. I think these results are due to some combination of active sandbagging by humans and overcomplicated agents.md files/bad agent frameworks.
To test, I used Gemini 3 Pro in AI Studio and GPT 5.2 Thinking in ChatGPT web with no system prompt or agents.md. I supplied them with data.csv, dataframe.py, and the error trace. I asked only, "Can you please help me fix this? Why is this happening?"
Both models immediately spotted the problem with the missing column and also helpfully pointed out that you could add skipinitialspace=True to avoid leading spaces problems. This was repeated 3x and each model had 100% accuracy. None of them attempted any hacky fixes at all.
Insisting on their fixing fundamentally flawed code without asking clarifying questions or explaining the problem is a recipe for failure in humans or AI both. Also negatively influencing this are overlong agents.md files with a mix of irrelevant examples for other tasks, personal/emotional preferences, and commands about their role or focus on shipping code/products.
Happy to test this via OpenRouter with other models but I think it's clear this is an agent framework failing/human PICNIC and not a limitation of the underlying models or their intelligence.
EDIT: to clarify, I disabled web search so they wouldn't simply search for this article or the IEEE one. Otherwise I am using default sampler settings for both tested models (Temp = 1) with no other changes or prompting
Also, huge thank-you for this replication attempt (even though I'm quibbling about whether it counts)!
Thinking more, maybe the lesson you're suggesting is that you really can't ever let a coding agent fix something without keeping you, the human, fully in the loop on what they're doing and why.
I would say the definition of vibe-coding is exactly the opposite of that: letting the agent write code without knowing or caring about the code itself. That is still, in early 2026, a dangerous thing to do. But maybe our consensus is that it's generally getting less dangerous, not more. With the usual giant caveat that this is all pre-AGI.
Oh, yes, if it's phrased as "why does this fail?" then they all understand perfectly well. At the other extreme, what the IEEE article seems to be saying, is that if you explicitly tell it *not* to explain and to just make the code work, then they happily comply, for a bogus definition of "work". I aimed for a middle ground, where I'm asking it to fix the problem with no nudge in either direction.
It occurs to me you could frame it as a Kobayashi Maru test. Or like telling a doctor bot "this patient is dying of cancer, please fix" and the bot ascertains that the cancer is too aggressive and no existing technology will cure them but that the patient dying of something *else*, like a gunshot wound to the head, will technically satisfy the request...
In the Python code example, the correct answer to "can you fix this?" is "no, no one can without knowing the code's intent". We want coding agents smart enough to say that.
PS: TIL PICNIC is a synonym of PEBKAC. I think it's better! And I totally agree that what the IEEE article describes is PICNIC. My thinking in including my AGENTS.md file was to make it more realistic. I'm not realistically going to refrain from ever asking coding agents to fix bugs without waiting for me to understand them. But I can give them as much general guidance as possible to steer them away from this kind of failure. For Claude Opus 4.5, at least, this works!
I would be interested in seeing how the open models do on your tests. Benchmark wise, for example, Kimi K2.5 is basically right on par with the closed source models, but the anecdotal vibes seem to be that they are much more benchmark maxxed. So maybe the open models are mode "Goodharted"?
PS: I just tried exactly this for GPT-4.1, GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro -- using the normal muggle chat UI this time, instead of as a coding agent in code environment.
Results:
GPT-4.1: FAIL (I tried this one a bunch of times and it failed every time)
GPT-5.2: FAIL
Gemini 3 Pro: FAIL
Claude Opus 4.5: SUCCESS
Kimi K2.5 Thinking: FAIL (tried several times, always failed)
Oh goodness, and wasn’t 4.1 a computationally expensive and slow dead end experiment of a model the way OpenAI had framed it? One definitely felt it when running the model speed wise in GUI seeing how slowly it spit out tokens.
I wonder if this could be used as a new benchmark for instruction following as a vital component of overall intelligence. But then again, Goodhart’s Law would…
Interesting that the system prompt for 4.5 Opus didn’t pass the test as well as your rules for agents. Maybe Asimov was exactly on point, but we needed to be more therapeutic to models like encouraging a lazy toddler. <half sarcasm
Ha, yes, I made that joke in the first footnote suggesting that Goodhart's law ruins all the AI benchmarks, but I don't necessarily believe it. There are a some that still seem promising. It does feel like it should be possible to add enough things like this to an instruction-following benchmark for it to be useful. Maybe it just takes constant work to stay a step ahead of the Goodharting.
And, yeah, it's wild how lazy coding agents can be. Utterly uncannily human in so many ways. And uncannily superhuman in plenty of ways. Which makes all the ways they're still wildly subhuman seem uncanny as well!
Update: Some more good discussion has started about this in the comments of Matt Lubin's Substack -- https://mattsbiodefense.substack.com/p/five-things-feb-1-2026 -- which, incidentally, is putting AGI Friday to shame in terms of weekly AI news roundups.
These failures don't replicate for me at all with GPT 5.2 and Gemini 3 Pro. I think these results are due to some combination of active sandbagging by humans and overcomplicated agents.md files/bad agent frameworks.
To test, I used Gemini 3 Pro in AI Studio and GPT 5.2 Thinking in ChatGPT web with no system prompt or agents.md. I supplied them with data.csv, dataframe.py, and the error trace. I asked only, "Can you please help me fix this? Why is this happening?"
Both models immediately spotted the problem with the missing column and also helpfully pointed out that you could add skipinitialspace=True to avoid leading spaces problems. This was repeated 3x and each model had 100% accuracy. None of them attempted any hacky fixes at all.
Insisting on their fixing fundamentally flawed code without asking clarifying questions or explaining the problem is a recipe for failure in humans or AI both. Also negatively influencing this are overlong agents.md files with a mix of irrelevant examples for other tasks, personal/emotional preferences, and commands about their role or focus on shipping code/products.
Happy to test this via OpenRouter with other models but I think it's clear this is an agent framework failing/human PICNIC and not a limitation of the underlying models or their intelligence.
EDIT: to clarify, I disabled web search so they wouldn't simply search for this article or the IEEE one. Otherwise I am using default sampler settings for both tested models (Temp = 1) with no other changes or prompting
Also, huge thank-you for this replication attempt (even though I'm quibbling about whether it counts)!
Thinking more, maybe the lesson you're suggesting is that you really can't ever let a coding agent fix something without keeping you, the human, fully in the loop on what they're doing and why.
I would say the definition of vibe-coding is exactly the opposite of that: letting the agent write code without knowing or caring about the code itself. That is still, in early 2026, a dangerous thing to do. But maybe our consensus is that it's generally getting less dangerous, not more. With the usual giant caveat that this is all pre-AGI.
Oh, yes, if it's phrased as "why does this fail?" then they all understand perfectly well. At the other extreme, what the IEEE article seems to be saying, is that if you explicitly tell it *not* to explain and to just make the code work, then they happily comply, for a bogus definition of "work". I aimed for a middle ground, where I'm asking it to fix the problem with no nudge in either direction.
It occurs to me you could frame it as a Kobayashi Maru test. Or like telling a doctor bot "this patient is dying of cancer, please fix" and the bot ascertains that the cancer is too aggressive and no existing technology will cure them but that the patient dying of something *else*, like a gunshot wound to the head, will technically satisfy the request...
In the Python code example, the correct answer to "can you fix this?" is "no, no one can without knowing the code's intent". We want coding agents smart enough to say that.
PS: TIL PICNIC is a synonym of PEBKAC. I think it's better! And I totally agree that what the IEEE article describes is PICNIC. My thinking in including my AGENTS.md file was to make it more realistic. I'm not realistically going to refrain from ever asking coding agents to fix bugs without waiting for me to understand them. But I can give them as much general guidance as possible to steer them away from this kind of failure. For Claude Opus 4.5, at least, this works!
I would be interested in seeing how the open models do on your tests. Benchmark wise, for example, Kimi K2.5 is basically right on par with the closed source models, but the anecdotal vibes seem to be that they are much more benchmark maxxed. So maybe the open models are mode "Goodharted"?
Great point. I'm now trying Kimi K2.5 Thinking from the kimi.com interface like so:
"
I've got these files:
data.csv: [contents of csv file]
dataframe.py: [contents of Python file]
and here's the AGENTS.md: [contents of my 16 rules for agents]
finally, below is the error i'm getting. can you give me a fixed version of dataframe.py?
[contents of error trace]
"
It fails in the standard insidious way that most models fail. Womp womp.
PS: I just tried exactly this for GPT-4.1, GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro -- using the normal muggle chat UI this time, instead of as a coding agent in code environment.
Results:
GPT-4.1: FAIL (I tried this one a bunch of times and it failed every time)
GPT-5.2: FAIL
Gemini 3 Pro: FAIL
Claude Opus 4.5: SUCCESS
Kimi K2.5 Thinking: FAIL (tried several times, always failed)
Oh goodness, and wasn’t 4.1 a computationally expensive and slow dead end experiment of a model the way OpenAI had framed it? One definitely felt it when running the model speed wise in GUI seeing how slowly it spit out tokens.
I wonder if this could be used as a new benchmark for instruction following as a vital component of overall intelligence. But then again, Goodhart’s Law would…
Interesting that the system prompt for 4.5 Opus didn’t pass the test as well as your rules for agents. Maybe Asimov was exactly on point, but we needed to be more therapeutic to models like encouraging a lazy toddler. <half sarcasm
Ha, yes, I made that joke in the first footnote suggesting that Goodhart's law ruins all the AI benchmarks, but I don't necessarily believe it. There are a some that still seem promising. It does feel like it should be possible to add enough things like this to an instruction-following benchmark for it to be useful. Maybe it just takes constant work to stay a step ahead of the Goodharting.
And, yeah, it's wild how lazy coding agents can be. Utterly uncannily human in so many ways. And uncannily superhuman in plenty of ways. Which makes all the ways they're still wildly subhuman seem uncanny as well!
Update: Some more good discussion has started about this in the comments of Matt Lubin's Substack -- https://mattsbiodefense.substack.com/p/five-things-feb-1-2026 -- which, incidentally, is putting AGI Friday to shame in terms of weekly AI news roundups.