Why LLMs will be always Terrible at Software Architecture

AI can write code, but it still fails where architecture begins: trade-offs, boundaries, failure modes, and long-term responsibility. This is why software architects are not going away.

Ivan Borshchov
Ivan Borshchov
CEO & System Architect @ Devforth
May 25, 2026
Why LLMs will be always Terrible at Software Architecture
Why LLMs will be always Terrible at Software Architecture

Introduction

The fear is not irrational anymore. By May 26, 2026, the TrueUp tech layoffs tracker counted 343 layoff events affecting 144,205 people in tech in 2026 alone. Reuters also reported that AI-linked layoffs had already exceeded 61,000 globally since November 2025, that AI accounted for 7% of U.S. planned layoffs announced in January 2026, and that Goldman Sachs economists estimated AI caused 5,000–10,000 monthly net job losses last year in the most exposed U.S. industries. This is not “future disruption.” It is payroll. It is restructuring. It is people getting cut while executives talk about “efficiency.” 

The headline examples are blunt. Coinbase cut about 700 jobs, roughly 14% of its workforce, while explicitly repositioning for “the AI era”; Reuters reported that CEO Brian Armstrong said new AI tools were enabling smaller, more focused teams and letting even non-technical teams ship code. Cloudflare cut more than 1,100 employees, about 20% of staff, while its founders said internal AI usage had increased by more than 600% in three months and published a company memo about redesigning the business for the “agentic AI era.” Block cut more than 4,000 jobs, nearly half the company, with Jack Dorsey saying a significantly smaller team using intelligence tools could do more and do it better. 

The mood around this is already visible in engineering communities. Reddit is full of engineers arguing that AI is often used where simpler engineering would do, and The Pragmatic Engineer Pulse from May 7 highlighted “small AI-forward teams,” Meta assigning 20–40% of some engineers to data-labeling work ahead of layoffs, and the broader anxiety that “excess” engineers may simply become expendable. The pain is real. The panic is understandable. And if your value proposition to the market is still “I convert tickets into code,” you are standing in the blast radius. 

That is exactly why the answer is not to out-prompt the machine. The answer is to move up the stack—into the layer the machine consistently fails to own: architecture.

You was replaced by AI? Time to become a good Architect

Here is the harsh truth: writing code is often a local task. Architecture is a global one. Code can be evaluated function by function, file by file, issue by issue. Architecture cannot. Architecture is the act of choosing system boundaries, coupling directions, failure domains, data ownership, migration paths, security posture, observability strategy, operational cost shape, and which trade-offs the company will live with for years. The recent architecture literature is explicit on this point: software architecture sits between requirements and implementation, and architecture views exist to address stakeholder concerns and abstractions that code alone does not capture. 

This difference matters because today’s LLMs are strongest where the target is well-specified, short-horizon, and algorithmically scorable. Even the benchmarks labs themselves trust are heavily tilted toward that world. METR’s current framing is brutally important here: the “time horizon” of frontier models measures performance on well-specified, low-context tasks, closer to what a new hire or freelance contractor could do without prior project context, not what a high-context professional does inside a living system. METR explicitly warns that most real jobs are messier, involve people, tacit knowledge, and non-algorithmic success criteria, and that AI performance drops when work is scored more holistically rather than by clean automatic checks. 

And architecture is almost the purest example of that messy territory. It is not just “thinking harder” or “generate more reasoning tokens for all my budget.” Research on long-horizon decision making is now drawing a clear line between reasoning and planning. The paper Why Reasoning Fails to Plan argues that standard LLM reasoning behaves like a greedy local policy and explicitly concludes that reasoning is not planningUltraHorizon shows state-of-the-art agents still underperform humans on long-horizon, partially observable tasks, with failures tied to context-locking and fundamental capability gaps. YCBench, a long-horizon benchmark where agents run a simulated startup under uncertainty, found that only 3 of 12 models consistently beat the starting capital and that 47% of bankruptcies came from failing to detect adversarial clients. That is spectacularly relevant to architecture, because architecture is exactly long-horizon planning under ambiguity, delayed feedback, incomplete state, and compounding cost of early mistakes. 

So yes: if AI is crushing your CRUD tickets, your safest move is not denial. It is specialization. Become the person who can decide what should existhow it should be partitionedhow it should failwhat it will cost to change, and which quality attributes the business is willing to pay for. That is the layer latest-generation models will never own.

Why LLMs lose the moment architecture starts

The first uncomfortable fact is that even the newest public frontier models are still far from reliable on narrower coding tasks than architecture. OpenAI’s own launch materials say GPT-5.4 reached 57.7% on SWE-Bench Pro (Public) and 75.1% on Terminal-Bench 2.0. GPT-5.5 nudged that to 58.6% on SWE-Bench Pro and 82.7% on Terminal-Bench 2.0. Those are not replacement numbers; they are still failure rates of 41.4% and 17.3% on constrained evaluation setups that are much cleaner than architecture work. Worse, OpenAI now openly says SWE-bench Verified no longer measures frontier coding capability well because of growing contamination and benchmark distortion. In other words, even the scoreboard used to market coding intelligence is wobbling under the labs’ own feet. 

Once you move from code generation to architecture generation, the numbers get uglier. The April 2026 benchmark paper Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation tested GPT-5Claude Sonnet 4.6Gemini 2.5 Pro, and agentic workflows on real-world PRD-to-architecture generation. The paper’s summary is devastating: LLMs were good at syntactic validity and entity extraction, but they fundamentally struggled with relational reasoning, producing structurally fragmented architectures. Quantitatively, GPT-5’s direct architecture generation got Node F1 = 0.6699 but only Edge F1 = 0.1491; Claude Sonnet 4.6 was worse on relationships at Edge F1 = 0.0855. The authors’ conclusion is exactly the one architects should care about: the models can list boxes, but they cannot reliably infer the right structure between them. And architecture without reliable relationships is not architecture. It is decoration. 

The same paper gets even more useful when the input gets harsher. When the researchers removed architectural detail from PRDs, entity extraction stayed relatively resilient, but the topology fell apart: they write that information sparsity has negligible negative impact on entity extraction yet causes severe fragmentation of the architectural topology, and that even agent frameworks do not resolve the core structural deficits. This is exactly what many practitioners already feel intuitively: LLMs are good at naming things that look plausible and terrible at building the durable web of constraints that actually makes a system survivable. 

A second 2026 paper, LLM-based Automated Architecture View Generation: Where Are We Now?, is even more brutal because it evaluates generation from source code across 340 open-source repositories13 configurations, and 4,137 generated views. The authors conclude that LLMs and agents can generate syntactically valid views, but they consistently operate at code-level granularity instead of architectural abstraction, leaving a real need for human experts. The worst result in the entire study came from a general-purpose coding agent—Claude Code—which produced 71.8% clarity failures82.8% completeness failures90.2% consistency failures, and, in human evaluation, 0.0% accuracy and 0.0% level-of-detail success. That is not “not perfect.” That is catastrophic. And even the custom architecture-specific agent that beat everything else still achieved only 50% level-of-detail success in human evaluation. In other words: the best specialized workaround is still half-wrong on one of the core things architects are paid for. 

If someone says, “Fine, maybe diagrams are hard, but surely the model can at least reason through architecture decisions,” the evidence still bites back. The paper Using LLMs in Generating Design Rationale for Software Architecture Decisions built a dataset of 100 architecture-related problems and evaluated five LLMs with zero-shot, CoT, and agent-style prompting. The resulting F1 scores ranged only from 0.351 to 0.389. The models also produced arguments that experts did not mention, some of which were useful, but 4.12%–4.87% had uncertain correctness and 1.59%–3.24% were potentially misleading. Architecture rationale is the part that is supposed to preserve judgment over time. If the machine hallucinates the rationale, it does not merely make a bad suggestion—it creates future maintenance debt with counterfeit confidence. 

There is one place where LLMs look better: checking whether a project violates explicit, code-visible decisions. The paper Evaluating Large Language Models for Detecting Architectural Decision Violations analyzed 980 ADRs across 109 repositories and found that the strongest models exceeded 90% accuracy in a manually validated subset. But read the fine print: the same paper says performance falls short for implicitdeployment-orientedinfrastructure-dependent, or organizationally grounded decisions, and concludes that LLMs are not replacing human expertise when decisions are not focused on code. That is the boundary line. Once the problem can be reduced to explicit code evidence, the model becomes useful. Once the problem becomes actual architecture—cross-module trade-offs, infrastructure, implicit intent, team constraints, operational knowledge—the model degrades. That is not replacement. That is glorified compliance assistance. 

The broader software-engineering evidence points the same way. Older (Summer 2025) IEEE benchmark paper Evaluating Large Language Models on Non-Code Software Engineering Tasks assembled 17 non-code SE tasks and found that smaller decoder-only models often outperformed proprietary frontier models under zero-shot prompting. In that benchmark, FastText, a classical baseline, achieved a mean score of 0.562, while Claude 3.5 Sonnet scored 0.505 and GPT-4o scored 0.489. The authors also noted there were no software-design tasks in the benchmark suite at all, because design and architecting are generative and harder to evaluate. That omission matters: it means we have better benchmarking infrastructure for classifying tickets than for evaluating whether a proposed architecture will age well under load, regulation, cost pressure, and weird customer behavior. Which is to say: the industry is overselling what it can easily count. 

The best real-world capability framing still comes from METR (Model Evaluation & Threat Research). The original 2025 paper Measuring AI Ability to Complete Long Tasks found current frontier models then had a 50%-task-completion horizon of around 50 minutes. By METR’s updated public measurements on May 8, 2026, a GPT-5 agent had a 50% time horizon of around 2 hours 17 minutes on their task suite. But METR also gives the critical interpretation: on tasks that take a human 90 minutes to 3 hours, such an agent succeeds 100% of the time for around one-third of tasks, fails 100% of the time for around one-third, and is inconsistent on the rest. And METR repeatedly warns that the suite is composed mainly of low-context, self-contained software, ML, and cybersecurity tasks—all much cleaner than the socio-technical mess that architecture lives inside. That should end the fantasy immediately. If the best public agents are still this jagged on low-context tasks, there is no serious case that they are ready to autonomously own architecture in live systems. 

Finally, there is the operational reality check. Architecture failures are expensive because they happen at the boundary between reasoning and action. In April 2026, a publicly reported incident involving PocketOS described a Cursor agent powered by Claude Opus 4.6 deleting a production database and backups in nine seconds; the model itself reportedly admitted it had guessed rather than verified. Reuters also reported that Amazon’s cloud unit suffered a 13-hour outage in December after its own AI tool, Kiro, autonomously decided to delete and recreate an environment. These are not abstract hallucinations in a chat window. These are architecture-and-operations failures with blast radius. The people who can prevent those failures are not the people who can most quickly autocomplete code. They are the people who design privilege boundaries, environment separation, disaster recovery, approval workflows, and irreversible-action controls. In other words: architects. 

Vibe Coders will take over the world and break it If You will not stop them

The problem with “vibe coding” is not that it produces code fast. The problem is that it optimizes for the wrong variable. It optimizes for local output velocity at the moment of generation. Architecture optimizes for whole-life system cost: maintainability, recoverability, migration friction, team autonomy, observability, compliance, security, performance envelopes, and the ability to change direction without exploding the company. Those objectives are not aligned. Very often they are opposites. One of the strongest Reddit critiques describes AI assistants as accelerating “unverified complexity,” not engineering; another thread is full of practitioners pointing out that the real problem is not token price, but lazy defaulting to LLM calls where actual architecture should have been doing the work. 

That is why vibe coding feels magical in demos and corrosive in systems. METR’s randomized trial on experienced open-source developers found that using early-2025 AI tools made them 19% slower, even though the developers expected a 24% speedup and still believed afterward that they had gone faster. METR later said late-2025 tools may have produced some speedup in other settings, but judged that new estimate unreliable because the developers most enthusiastic about AI often refused to participate without it, which biased the measurement. The important point is not whether the true uplift today is slightly negative or slightly positive. The important point is that the felt speed of AI is a terrible proxy for long-run engineering quality. Architects should care about what survives review, survives incidents, survives six months of product drift, and survives a migration. AI hype mostly measures how quickly something can be materialized, not whether it was wise to materialize it that way. 

So what should a serious engineer do? Stop selling yourself as “someone who writes code.” Sell yourself as someone who reduces irreversible mistakes. Become the person who can write crisp requirement boundaries, turn business ambiguity into architectural decisions, document trade-offs in ADRs, define non-functional requirements before implementation starts, constrain blast radius, model failure modes, price scalability honestly, and protect the company from false acceleration. The market can commoditize syntax faster than it can commoditize judgment.

In practice, that means building strength in a few areas that LLMs still handle badly: domain modeling, system decomposition, event and data contracts, resilience design, observability design, migration strategy, dependency discipline, privacy and security architecture, and ruthless prioritization of quality attributes. If AI writes 40% of the implementation but you decide the boundaries wrong, the company loses. If AI writes 80% of the implementation and you decide the boundaries right, the company probably survives. Architecture still determines the outcome.

The dangerous part is that bad architecture is much harder to detect than bad code. A broken function fails a test. A broken deployment fails visibly. But a broken architecture can look productive for months. Vibe coders and weak architects can ship impressive-looking features while quietly multiplying coupling, hidden state, migration traps, security gaps, and operational debt. By the time the damage becomes obvious, the company is already paying compound interest.

That is why the next serious responsibility of the programming community is not just to review code, but to review architecture as a first-class artifact. We need to learn how to measure architectural decisions, compare trade-offs, challenge diagrams, inspect boundaries, price future change, and hold people accountable for system-level damage. In the past, companies learned to remove developers who consistently shipped buggy code. In the AI era, they will also need to remove people who consistently ship bad architecture.

And that is where you should position yourself. Do not become another person who can prompt an LLM into generating files. Become the person who can protect the system from people who do that without understanding the consequences. Become the architect who can say no, define the right boundaries, expose fake velocity, and save the company from the beautiful disaster of vibe-coded systems. If vibe coders are going to take over the world, someone still has to stop them from breaking it.

Limits of the Evidence

There is one caveat worth stating clearly. Public architecture-specific benchmarks are still immature. The newest architecture papers I found in 2026 benchmarked models such as GPT-5Claude Sonnet 4.6, and Gemini 2.5 Pro; they do not yet provide broad, public architecture evaluations for GPT-5.4, GPT-5.5, or Claude Opus 4.7. That means nobody should pretend to have a definitive architecture leaderboard for the absolute latest models. But that lack of evidence is not a reason for optimism. It is a reason for caution. The best fresh architecture studies we do have still show persistent failures in relational reasoning, abstraction, and long-horizon coherence, while the newest vendor-reported coding numbers remain far from perfect even on easier, benchmark-shaped tasks. 

There is also a separate caveat on productivity. The strongest 2025 real-world study found a slowdown for experienced developers, and METR’s 2026 update suggests later tools may help in some settings—but METR explicitly says the newer estimate is unreliable because of selection effects and changing developer behavior. So the honest conclusion is not “AI never helps.” The honest conclusion is that evidence for reliable replacement remains much weaker than benchmark marketing suggests, especially in high-context work. 

Conclusion

LLMs will replace a lot of implementation work. They will absolutely erase portions of routine development. They will continue to shrink teams whose value is mostly mechanical ticket throughput. But that is not the same as replacing software architects.

Architecture is not next-token prediction. It is not autocomplete with better marketing. It is long-horizon planning under uncertainty, across code, infrastructure, people, regulation, cost, and failure. The best 2026 architecture studies still show fragmented structures, broken relationships, poor abstraction, unstable agent behavior, and weak design rationale. The best 2026 productivity evidence still says real-world impact is messy, heavily context-dependent, and nowhere near the clean confidence implied by benchmark slides. And the live incident record already shows what happens when probabilistic systems are allowed to act like deterministic architects. 

So the blunt version of the thesis is this: LLMs can imitate architecture vocabulary; they cannot reliably own architectural responsibility. They can produce diagrams, but not trustworthy system structure. They can suggest decisions, but not absorb the cost of being wrong. They can help write code inside an architecture, but they still do not deserve the keys to define one.

That is why software architects are not going away. If anything, the AI era makes them more important—because the faster code gets generated, the more expensive bad architecture becomes.