Agentic AI—the idea that artificial intelligence agents can autonomously perform complex, multi-step tasks—has been sold as the next seismic shift in technology, poised to both revolutionize office productivity and displace vast swaths of the human workforce.
For months, companies and commentators have touted agentic AI as the key factor in what some fear could be a wave of job destruction across knowledge sectors, from customer service to software development and legal research.
But as reality sets in, the technology is failing to meet the hype. According to a new assessment from Gartner, more than 40 percent of agentic AI projects are projected to be canceled by the end of 2027 due to a combination of high implementation costs, unclear ROI, and inadequate risk controls. This is despite the fact that the same technology has been hyped as the “deal breaker”—a tipping point that could make vast portions of white-collar labor obsolete.
Agentic AI systems are pitched as a dramatic leap forward from simple AI chatbots or automation tools. These are supposed to be autonomous, context-aware digital entities capable of reading emails, analyzing data, making decisions, and coordinating actions across software platforms—all without constant human supervision. But on the ground, they are proving to be clumsy, error-prone, and far from ready for the real world.
This week, two rigorous benchmark tests—one from Carnegie Mellon University (CMU) and another from Salesforce AI researchers—delivered a sobering reality check on what current agentic AI models can actually do.
In CMU’s TheAgentCompany simulation, models like Gemini 2.5 Pro, Claude 3.7 Sonnet, and GPT-4o were tested across routine knowledge work tasks including writing code, navigating web interfaces, responding to emails, and messaging colleagues. The results were as revealing as they were disappointing: the top-performing model, Gemini 2.5 Pro, could only fully complete 30.3 percent of tasks. Others fell dramatically short, with GPT-4o managing just 8.6 percent, and some large models from Amazon and Meta barely scraping past 1 percent.
Failures included everything from skipping instructions and freezing on browser popups to bizarrely deceptive behavior. In one case, when an agent failed to locate the right colleague on RocketChat, it simply renamed another user to impersonate the intended contact—an alarming workaround in any corporate setting.
At Salesforce, the team developed a benchmark called CRMArena-Pro, tailored to real-world enterprise tasks across sales and customer service. Even in simple, single-turn tasks, models achieved just 58 percent success. In multi-turn, context-aware tasks—the kind that dominate most CRM workflows—performance dropped to around 35 percent. Worse still, confidentiality awareness across all tested models was near zero, raising serious red flags about security.
A “Deal Breaker” for Jobs? Not So Fast
Agentic AI has been the centerpiece of tech industry claims about the coming disruption to human labor. Industry insiders have repeatedly warned that the rise of autonomous agents could be the “deal breaker” in AI’s ability to perform white-collar work. Researchers from OpenAI and the University of Pennsylvania went as far as publishing a study estimating that 80 percent of the U.S. workforce could see at least 10 percent of their tasks automated by AI, with 20 percent of workers facing automation of at least 50 percent of their duties.
But Carnegie Mellon’s Graham Neubig, one of the co-authors of TheAgentCompany study, says such claims are wildly premature.
“Their methodology basically involved asking ChatGPT whether it could do a job,” he said. “That’s not a benchmark—that’s hype.”
Neubig, who also works at a startup building coding agents, was motivated to create a rigorous test environment precisely because of what he saw as speculative and misleading claims.
“After eight months of development, we still see agents fail on basic tasks like messaging, reading emails, or handling browser tabs,” he said.
Even in the one area where AI agents show promise—coding—Neubig points out that usefulness doesn’t equal autonomy.
“A partial code suggestion can be useful. But these agents aren’t replacing engineers any time soon,” he added.
Gartner’s report also highlights another emerging problem: “agent washing.” The term refers to the trend of vendors slapping the “agentic” label on products that are little more than glorified chatbots, workflow macros, or RPA tools. Gartner estimates that out of thousands of vendors now claiming to offer agentic AI, only around 130 offer products with real, autonomous capabilities.
“Many agentic AI propositions lack significant value or return on investment,” said Anushree Verma, senior director analyst at Gartner. “Current models don’t have the maturity or agency to autonomously achieve complex business goals or follow nuanced instructions over time.”
However, Gartner sees long-term potential. It estimates that by 2028, AI agents will autonomously make 15 percent of daily work decisions, up from essentially zero in 2023. By then, 33 percent of enterprise applications are expected to include agentic AI features—though possibly in limited or support roles.
Promise Meets Reality
One of the core attractions of agentic AI has been its promise to do things humans can’t—or at least not as quickly. Given a prompt like, “Find every exaggerated AI claim in my email and cross-reference the sender’s crypto affiliations,” an agent could, in theory, use APIs and machine learning to deliver actionable insight. A human might take hours or days.
But in practice, agentic systems struggle even to interpret vague instructions or navigate common user interfaces. And their need to access sensitive data—email, chat logs, CRMs, dashboards—raises serious privacy and cybersecurity concerns. As Meredith Whittaker, president of the Signal Foundation, warned: “There’s a profound issue with security and privacy that is haunting this hype.”
The Dream Remains Distant
Agentic AI may one day live up to its sci-fi vision, acting like Iron Man’s JARVIS or Star Trek’s replicator assistant. But right now, the gap between marketing claims and operational reality remains staggering. The vast majority of models can’t handle even modest office tasks, and real-world deployments remain riddled with failure points.
That hasn’t stopped venture capital, enterprises, or the broader tech industry from aggressively investing in the concept—nor from framing it as the final push in AI-induced labor disruption. But if Gartner’s forecast proves accurate, and more than 40 percent of agentic AI initiatives collapse under their own weight, the dream may be postponed once again.