Three Months of Slop
During the development of Hummingbird 0.2, I made a personal goal to attempt to use AI more.
I know what you're thinking; that's a weird goal, why?
This is a practical and measured utilitarian view on agentic AI. Obviously, I have ethical objections to the current way that many AI companies operate, especially OpenAI. Positive experiences with a given product do not indicate that I support the business practices of the company responsible for the product or think that you should give them your money.
I regularly assess the state of LLMs. As someone whose livelihood is heavily and directly affected by the capabilities, performance, and reliability of these models, it would be irresponsible for me not to do so.
That being said, I agree with many of the concerns shared by my peers – this just isn't a blog post about those concerns.
In November I knew no one who was using “coding agents” (like Claude Code and the Gemini CLI) regularly. In December, it seemed, genuinely, like the complete opposite. Companies that had previously been cautiously optimistic about LLM usage were now all in on it, and people that I had worked with for years and have a great deal of respect for were suddenly talking about how they “hate to come across as an AI glazer, but it's genuinely useful now”.
I also hate to say it, but it's hard to disagree.
The turning point was really, according to everyone I've talked to, the release of Anthropic's Opus 4.5 model in late November, followed closely by OpenAI's GPT-5.2-Codex. These two models both far surpass their prior iterations at programming tasks, despite not performing all that much better on the benchmarks that are supposed to indicate that progress. They've been lauded for their ability to spit out clean, “well designed” landing pages for applications that don't exist, generate poorly-designed but structurally reasonable Minecraft houses, and, most importantly, their ability to allow you to “never write a line of code again”.
The reasons for this are 'mysterious' only if you simply do not read any of the output of the model. The absolutely miraculous innovation allowing for unparalleled productivity and never-before-seen task completion prowess? Reading the damn code it's supposed to work with. That's the trillion dollar innovation.
That's right! Somehow, despite the fact that these models (and their harnesses) are designed by people with doctorates in computer science and applied mathematics, somehow no one thought that the model should read the damn documentation of the code it's supposed to be working with. This makes sense, because everyone knows that no one should ever read the documentation, and we all just write it because it's a fun and enjoyable process that is just so much better than writing code.
Regardless, what's done is done. A more interesting discussion is, now that it's there, is it really all it's cracked up to be?
No, But (alternatively, Yes, But)
Look, it is useful. There are times where I've asked it to do something, and it very nearly does it the way that I want it to be done. Of course, getting it to do the rest is an exercise in futility bordering on catastrophically infuriating, and so I find that usually I wind up rewriting about half the code it spits out. This is the case about 60% of the time.
Ironically, the other 40% of the time is the simple questions.
“Can you fix the menu re-opening when the button to open/close it is clicked? This happens because the handler for clicking out of the menu fires before the event for clicking the menu button runs, so the menu is closed and then re-opened.”
This is a simple problem that I would expect anyone to be able to solve easily. The real solution was to simply change the button handler from “click” to “mouse down” (the same as the click out handler) and then just stop the event from propagating downwards.
This is a fairly standard fix, the bug was clearly described, and the cause of the bug is also described in plain English. I would expect anyone with even a limited amount of web-style application development experience to be able to fix the bug. The actual fix itself was less than 60 characters before being auto-formatted.
I gave this prompt (with file links) to Opus 4.6, Sonnet 4.6, GPT-5.3-Codex, and Gemini 3.1 Pro, and 0/4 of them were able to fix it. In fact, 4/4 of them caused the same new bug, repeatedly, which prevented the menu from being closed at all. 4/4 of them understood that, yes, the issue was event propagation, and they were all able to explain why it didn't work, but not a single one of them could fix it.
On the flip-side, I asked Opus 4.5 to implement queue drag and drop in Hummingbird. It was able to do this in 2 prompts (a plan prompt and an implementation prompt), with the exact UX I had in mind, and the code worked perfectly first try. Of course, the code quality wasn't perfect – again, I wound up rewriting much of it to better follow GPUI best-practices and integrate better with Hummingbird's theme system, but it worked and, technically speaking, I could have left it as is without much issue.
This dichotomy is extremely exhausting in practice. It's made me question my sanity at least 10 times the last 2 months, and it makes the whole “you should unquestionably adopt AI because it can do X” narrative much harder to believe in when it cannot move a single HTML element up one pixel (a one line change).
Additionally, and anecdotally, I've been reviewing a lot of code recently. Someone's agent made the mistake of making a 1500 line change to tailwind.css, adding hundreds of unused and nonsensical styles. Still wondering how that even happens.
The Gas Guzzling Giant In Your Backyard
The elephant in the room every time that any AI thing is brought up is the resource consumption. It's a fair point, if a little overblown.
Most of the data center companies have been working towards improving the sustainability of their data centers. Not for nothing, I'm sure it wasn't cheap to use 14 septillion gallons of water an hour (or whatever it was). In my conversations with the people who work on these projects, the prevailing sentiment is that they're painfully aware of the resource consumption of all of this. High on the priority list of any data center is minimizing power and cooling costs and resource consumption, because not only are the bad for the environment, that's also pretty much 90% of the cost of operating a data center.
A lot of new construction is being built in caves, or otherwise underground, designed with thermal systems that don't use water, and with plans to build power grid improvements to handle the new load. The first two have seemingly panned out well, from what I've heard, though the last point has been somewhat problematic.
All that being said, I'm optimistic about the industry's continuous resource consumption. I'm much less optimistic about the constant construction and manufacturing demands of companies trying to rapidly scale hardware capacity – but I don't know much more about it than anyone else would, so I don't have any comments on that front.
A Conclusion
Most of the AI-generated code that has landed in Hummingbird has been refactored, redesigned, or just straight up tossed by this point. Hummingbird 0.2 is still completely my vision for the update, and almost none of the UX/UI design output by any model was kept for the final release. Some of the code, probably about half, will make it in to the final release in some form or another, but it has all been carefully examined repeatedly to ensure it's correctness.
I'm sure some of you are saddened or mad at my use of LLMs in the first place. I understand this, and I appreciate your concerns. If you feel that, given the circumstances, this is unacceptable, please feel free to make your concerns known.
I am still not bullish on LLMs in the way that Anthropic & Co would like everyone to be. I do think they are useful, and I do plan to continue to use them in some capacity (though less than I have the last 3 months), but to say that we are trending towards software developers being unnecessary in their entirety is patently absurd. Modern LLMs may have better quality output, but they still share many limitations with the LLMs of 3 years ago – and on those fronts, they've not gotten any better.
Thanks
Thanks to all of my patrons on Patreon.
In no particular order: – Ro – Claire Sorrel – Mikayla Maki – Naomi Hikaru – aloraxic
Thank you for supporting me and my work!