5 Mistakes with AI Agents — And Why Manufacturing Needs Different Answers Than the Startup World

An Industrial Translation of Jamin Mahmood-Wiebe’s t3n Article


Jamin Mahmood-Wiebe recently published a piece on t3n outlining five mistakes that stall AI agents in the enterprise. His conclusion: “Architecture beats technology.” It’s a solid article. But it’s written from the perspective of web development and SaaS startups.

I’ve spent 30 years in manufacturing. And I can confirm: every single one of these five mistakes exists on the shop floor — just with very different consequences.

When a chatbot on a website hallucinates, a customer gets a wrong answer. Annoying. When an AI agent in the supply chain hallucinates, it triggers a purchase order for 80,000 parts that nobody needs. Or it classifies a critical supplier as low-risk — three weeks before the line goes down.

The stakes are fundamentally different. And that’s why industrial midmarket companies need different answers than those presented at tech conferences.

Here’s my translation.


Mistake 1: Demo vs. Production — The “Potemkin Factory”

What t3n describes

In a pilot with 500 requests, everything looks great: 95 percent accuracy, two-second response times. The board is impressed, the budget gets approved. Then the system goes live — 10,000 requests per day — and accuracy drops to 80 percent while latency explodes by a factor of 20.

Mahmood-Wiebe recommends load testing with real data and a phased rollout: start with 5 percent of volume, then 20, then 50.

The industrial reality

Fair recommendation. But in manufacturing, it doesn’t go far enough. Because the problem starts in the pilot itself.

95 percent accuracy sounds like a reasonable starting point in web development. In supply chain operations, it’s a disaster. I’ve done the math in my article on FMEA 2.0 for AI agents: at 1,000 automated purchase transactions per week, a 5 percent error rate means 50 faulty transactions. Each one can trigger an emergency shipment, a contractual penalty, or a line stoppage. The financial impact per error in an automotive environment sits comfortably between 200 and 500 euros — conservatively. That’s 10,000 to 25,000 euros per week. Half a million to a million per year.

And the phased rollout? In manufacturing, we’ve known this concept for 40 years. We call it the ramp-up curve. Every production manager knows: when a new machine is installed, you don’t run it at full capacity on day one. You ramp up shift by shift, measure scrap rates, adjust parameters.

That’s exactly how we need to treat AI agents. Not as a software deployment with a release date, but as the commissioning of a machine — with a ramp-up curve, quality gates, and a supervisor standing right next to it.

This isn’t a rollout plan. It’s a commissioning protocol.


Mistake 2: Vibe Coding — “When the Intern Programs the Factory”

What t3n describes

The AI network Moltbook was exposed: 4.75 million records were publicly accessible because the founder had the entire platform generated by an AI assistant — with no security review. Backslash Security found that GPT-4o produces vulnerable code in 90 percent of cases. Mahmood-Wiebe calls for automated security scans and mandatory code reviews.

The industrial reality

In the midmarket, the code problem is real. But it’s not the primary risk. The primary risk is the process problem.

Consider this: an internal “AI champion” — motivated, technically capable, but without deep process knowledge — uses Copilot to build an agent that generates purchase order proposals in SAP. The agent works in testing. It gets deployed. What nobody validated:

  • Does the agent know the minimum order quantities from the framework agreements?
  • Does it account for blocked stock logic?
  • Does it know that Supplier X only delivers against prepayment?
  • Does it understand that hazardous materials require different freight routes?

The answer, in most cases: No. Not because the code is bad. But because the business rules were never made explicit. They live in the head of the dispatcher who’s been doing the job for 15 years. They’re not in any specification document and they’re certainly not in any prompt.

In web development, vibe coding needs a security review. In manufacturing, it needs a process review: Does the agent understand the business rules that no one ever documented?

This is the moment where “Adult Supervision” stops being a metaphor and becomes an operational necessity. Someone needs to stand between the AI system and the ERP system who understands both worlds. Someone who speaks the language of the engineer — and the language of the algorithm.


Mistake 3: Hidden Costs — The Token Trap in Procurement

What t3n describes

A striking example: one company started with agent costs of $500 in week 1. By week 4, they were at $18,400. No one had defined a cost dashboard or a loop budget. The Google DeepMind/MIT study confirms: multi-agent systems cost multiples more per solved task than single agents — while delivering worse results.

The industrial reality

The cost explosion Mahmood-Wiebe describes is an infrastructure problem in the tech world. In manufacturing, it’s a business case problem — and we need to frame it in categories that the CFO understands.

The critical question isn’t: “What do the tokens cost?” The critical question is: What does the agent cost per business transaction?

  • What does it cost per automated purchase order?
  • What does it cost per supplier evaluation?
  • What does it cost per RFQ analysis?

If the answer is “four times what the buyer costs who does it manually” — that’s not progress. That’s innovation theater with a negative business case.

In manufacturing, we have the governance instrument for this. It’s called a cost center. Every agent needs a virtual cost center with a budget cap. When the agent has consumed its token budget for the month, it doesn’t go into overdraft — it escalates to a human.

And the finding that multi-agent systems are more expensive than single agents? That’s the “motor vs. body” trap in its purest form: three engines in one car don’t make it faster. They make it heavier, more expensive, and impossible to maintain. One well-built engine, deeply integrated into the right chassis, beats three loosely wired engines on a test bench — every single time.


Mistake 4: Multi-Agent Without Physics — The 45 Percent Rule on the Shop Floor

What t3n describes

Google DeepMind and MIT showed across 180 controlled experiments: once a single agent solves more than 45 percent of a task correctly, adding more agents provides negligible improvement. For sequential tasks, additional agents actually degrade results by 39 to 70 percent.

The industrial reality

This finding is particularly brutal for supply chain operations. Because supply chain processes are sequential.

Requirements planning → Purchase requisition → RFQ comparison → Purchase order → Order confirmation → Goods receipt → Invoice verification.

Every step depends on the one before. Every error propagates downstream. This isn’t a software architecture problem — it’s physics. Or more precisely: it’s the logic of production scheduling that every manufacturing planner knows by heart.

Now imagine: one agent performs supplier evaluation at 60 percent accuracy, and a second agent “reviews” the result. What happens is not what the architects hope for. The second agent doesn’t make it better. It makes it different. And suddenly you have two agents contradicting each other on the risk assessment of Supplier X. Who decides?

Exactly: a human. The same human you were trying to relieve.

The solution isn’t less ambitious — it’s more focused: one agent per clearly defined process step. With specified inputs and outputs. With a human gate between steps, until trust has been established.

In manufacturing, we call this principle takt time. Every station does one thing. Correctly. Reliably. And when one station fails, everyone immediately knows where the problem is.

AI agents need takt. Not teamwork.


Mistake 5: Context Engineering — Without Master Data, Everything Falls Apart

What t3n describes

Anthropic demonstrated that Claude Opus 4.5 initially scored only 42 percent on a benchmark — not because of the model, but because of rigid evaluation criteria that marked “96.12” as incorrect when “96.124991…” was expected. After repairing the evaluation (not the model), performance jumped to 95 percent. Mahmood-Wiebe’s conclusion: it’s not the model that matters, but the context the agent sees at every step. “Context engineering” is the real architectural discipline.

The industrial reality

This is where my heart beats loudest as an “Industrial Translator.” Because what Mahmood-Wiebe calls “context engineering” is, in manufacturing, a well-known — and chronically neglected — discipline: master data quality.

I experienced this firsthand at an automotive supplier near Stuttgart, years ago. We were implementing a Transport Management System — technically flawless, professionally configured. Then came the reality check: the system couldn’t optimize routes. Not because the algorithm was bad. But because nobody knew how much the parts weighed. Weights and dimensions — the most fundamental master data in logistics — were missing. Or wrong. Or buried in an Excel file that hadn’t been updated in years.

That was 2015. And it hasn’t gotten better by 2026. When you tell an AI agent today: “Evaluate the delivery performance of Supplier X,” that agent doesn’t just need access to goods receipt data. It needs:

  • Correct supplier classification (A/B/C)
  • Current framework agreement data
  • Historical deviation rates
  • Information on alternative suppliers
  • Knowledge of tolerance thresholds per material group

If any one of these data points is missing, incorrect, or outdated, the agent doesn’t return an obviously wrong answer. It returns a plausible-sounding wrong answer. And that’s more dangerous than an obvious error, because nobody questions it.

Context engineering in manufacturing is not prompt optimization. It’s the combination of master data management and process expertise. And that combination can’t be delivered by a prompt engineer who has never touched a bill of materials. It requires the Business Quotient — the ability to translate technological capability into operational reality.


Conclusion: Process Knowledge Beats Architecture

Jamin Mahmood-Wiebe writes: “Architecture beats technology.” I agree. But I’d add:

Process knowledge beats architecture.

The best token budgets, load tests, and evaluation frameworks in the world won’t help if the person designing the agent has never touched a bill of materials. If they don’t know what a blocked stock posting is. If they can’t distinguish between a framework agreement call-off and a one-time purchase order.

The five mistakes from the t3n article are real. But their industrial translation reveals something deeper: the solution doesn’t lie in better technology, nor in better architecture, and certainly not in better prompts.

The solution lies in Adult Supervision — in people who understand technology and operational reality simultaneously. Who don’t just admire the engine, but know how to integrate it into the right chassis. Who understand that a ramp-up curve isn’t a rollout plan — it’s a commissioning protocol.

That’s what I call “Industrial Translation.”

And it’s what the midmarket needs right now — not more agents, but the right people to govern them.


The original article by Jamin Mahmood-Wiebe was published on t3n.

E-Mail: sven.vollmer@business-quotient.com

Sven Vollmer is “The Industrial Translator.” He bridges the gap between industrial operational reality (SAP, supply chain) and the possibilities of generative AI. His focus is on value-creating applicationsbeyond the hype.

Transparency Note: This article was created with editorial support from AI (Gemini/Claude). The ideas, technical validation, use case selection, and adult supervision were 100% authored by Sven Vollmer.

LinkedIn: www.linkedin.com/in/sven-vollmer-bq

Similar Posts