Vibe Coding Is Just Not There Yet | Goliath Dynamics Inc.

Vibe coding is all the rage right now — give a short blurb of specs and requirements, and the AI just builds the application.

Unfortunately, LLMs writing code are really bad at certain things. And since LLMs are an exponential development accelerator, these issues which are minor nuisances in human coding become absolutely infuriating. What I mean by that is that we’ve been pumping out packages, projects, and applications at blazing speeds — things that used to take a team of 5-10 developers multiple months can now often be accomplished by a solo developer in a week. If the painful jab of a failure comes up in 10% of tasks of that type, that 10% was spread out over multiple months, over multiple developers. But when a solo developer is now handling all of those errors in a week, those jabs turn into a jackhammer.

Here are some examples that have come up frequently, regardless of model we’ve tried. And we’ve been using GPT4, GPT5, Claude Sonnet, Gemini 2.5, and Grok.

Code Re-usage is Terrible

This happens all over the place, but CSS and react components are the most prominent examples. Even when instructed to re-use code maximally, even specifying to use semantic styling and components (e.g. titles, cards, buttons), LLMs will consistently generate a new class or ID for every single one, usually with significant inconsistency between fonts, sizes, padding, colors, etc. Same deal with react components — LLMs regularly redefine duplicate components that already exist rather than re-using existing ones.

Prompt engineering consistently fails to address this even when adding background instructions to the agent. Telling it to re-use classes and components typically doesn’t fix the problem in any major model we’ve tested.

Why it matters:

this creates larger and larger codebases of unnecessary fluff code. Many of these have so much bloat that even a small application suddenly is huge. People simply get infuriated when the application takes even a tiny amount of time to load, and they just leave.
across-the-board changes become impossible. For example, changing the styling of “all titles” means every single one of the titles has to be changed. The AI couldn’t figure it out the first time that they were all titles… it doesn’t magically figure it out now when there are hundreds or thousands of title elements all with near-same-but-definitely-different CSS definitions.

Common suggestions:

Tighter specs. Sometimes you can get cleaner output by explicitly defining what to use in the prompt, e.g. “This should be a card component with the full-width card class. Do not create additional styles for this without explicit permission.”
Explicit CSS, Tailwind. I suspect this problem is caused by limited context that the LLM sees, as using explicitly defined CSS in the HTML, or a non-semantic shorthand framework like Tailwind tends to improve rates of failure. Tailwind is non-semantic in that you’re not defining a title class and the styling for that class… you’re using shorthand class names that effectively hard-code the CSS.

Ultimately, these plus manually going into the code and fixing it manually are really the only options right now. Yes, manual fixes are time consuming, and everything would have been better if the LLM just didn’t generate the junk code in the first place.

Domain Knowledge is Missing

Most domains have some sectors of knowledge that are not in any public documentation that was available for training by these LLMs. The result is the LLM completely misses that context, creating sub-optimal or outright harmful code.

Why it matters:

We’ve seen this in email delivery, social automation, and ad tech. If a human who knew how things work wasn’t there to handle it, the outcomes would be disastrous — account bans, IP blocks, ad traffic flagged as ad fraud.

Least terrible solutions:

Unfortunately, the only solution we’ve had here is to make sure this domain knowledge gets into the spec.

Stochastic Understanding

In computer science, taking “all possible outcomes” of something is either stochastic or deterministic. Deterministic, in this context, means that all possible outcomes are knowable even before running the code. Stochastic, in this context, means there are effectively infinite clusters of possible outcomes. I say “in this context”, because if the state space isn’t already trained into the LLM’s context, it must be defined by the user AND fit into the LLM’s context window. Otherwise, LLMs tend to be dumpster fires of not being able to handle the situation, just making up garbage answers.

Why it matters:

Canary testing and devops are one of the most common examples of stochastic state spaces. These techniques are the gold standard in large scale development, and used by every major tech company extensively. In canary testing, you roll out a change to a very small percentage of users, log practically everything, and then analyze (a) whether any errors came up, and (b) if any expected outcomes became more or less common. But without context of what to look for, what has meaning, the LLM doesn’t magically know. For example, the rate of login failures is of critical importance, but to an LLM, that’s no more or less meaningful than the arithmetic mean count of the letter ‘z’ in usernames. This overlaps with the prior problem, in that some of these are caused by lack of written domain knowledge.

Overall

This isn’t to say that LLM generated code is all terrible. We’ve been pumping out packages and applications at blazing speeds — things that used to take a team of 5-10 developers multiple months can now be done by a solo developer in a week. Most of the code monkey work is just gone. But there are definitely some major pitfalls we’ve found that can be quite infuriating, and it’s certainly not where it needs to be for non-developers to make production-ready packages and applications.

Artificial Intelligence (AI)

Large Language Models (LLMs)

Product Development

Prototyping

Vibe Coding