Discussion about this post

User's avatar
EpistemicHummusility's avatar

These failures don't replicate for me at all with GPT 5.2 and Gemini 3 Pro. I think these results are due to some combination of active sandbagging by humans and overcomplicated agents.md files/bad agent frameworks.

To test, I used Gemini 3 Pro in AI Studio and GPT 5.2 Thinking in ChatGPT web with no system prompt or agents.md. I supplied them with data.csv, dataframe.py, and the error trace. I asked only, "Can you please help me fix this? Why is this happening?"

Both models immediately spotted the problem with the missing column and also helpfully pointed out that you could add skipinitialspace=True to avoid leading spaces problems. This was repeated 3x and each model had 100% accuracy. None of them attempted any hacky fixes at all.

Insisting on their fixing fundamentally flawed code without asking clarifying questions or explaining the problem is a recipe for failure in humans or AI both. Also negatively influencing this are overlong agents.md files with a mix of irrelevant examples for other tasks, personal/emotional preferences, and commands about their role or focus on shipping code/products.

Happy to test this via OpenRouter with other models but I think it's clear this is an agent framework failing/human PICNIC and not a limitation of the underlying models or their intelligence.

EDIT: to clarify, I disabled web search so they wouldn't simply search for this article or the IEEE one. Otherwise I am using default sampler settings for both tested models (Temp = 1) with no other changes or prompting

SorenJ's avatar

I would be interested in seeing how the open models do on your tests. Benchmark wise, for example, Kimi K2.5 is basically right on par with the closed source models, but the anecdotal vibes seem to be that they are much more benchmark maxxed. So maybe the open models are mode "Goodharted"?

7 more comments...

No posts

Ready for more?