Because generative AI (genAI) tools and services have become so ubiquitous (and popular), the costs of using them are going through the roof — leading to an insatiable appetite for tokens. Tokens represent a common way to measure and price AI use. Much like letters and words in English, large language models (LLMs) grasp a sentence or query by breaking words into tokens. With the AI explosion well under way, tokens are now “the fundamental units of data our models process, many representing a problem being solved,” according to Google CEO Sundar Pichai. (Google, by the way, processes about 3.2 quadrillion tokens a month.) But as the price of all those tokens adds up, business and IT execs are looking for ways to cut costs while keeping corporate productivity up. Uncontrolled token use has already landed one company with an unexpected $500 million AI bill. There are a number of ways companies can rein in the price of AI at the model, infrastructure, silicon, and business levels. Here’s a look at how some of those savings might actually be achieved. Switch to lower-cost models One way of potentially saving money is by re-routing AI work to a cheaper model, Pichai said. At Google that would Gemini 3.5 Flash. It delivers “frontier-level capabilities at less than half the price of comparable frontier models. “If companies use a mix of [Gemini 3.5] Flash and other frontier models, they could save a lot of money,” Pichai said. Those kinds of models provide cheaper tokens, with reasoning that’s good enough for many users — if not as strong as mainstream Gemini 3.5 — to deliver useful results. “There is sometimes overkill with the [LLMs],” said Deepak Seth, senior director analyst at Gartner. “I don’t always need a large language model which has been trained on the works of Charles Dickens and Shakespeare and Harry Potter.” Hyperframe Research principal analyst Steven Dickens can’t stop using Amazon’s Quick, which costs $20 a month, for personal tasks. “It is great personal ROI as it has not only made tasks faster, but unlocked tasks I would never have even attempted previously,” Dickens said. Don’t forget the hardware and software part of the equation The token crisis isn’t new, said Dheeraj Pandey, CEO of DevRev, who likens what’s going on now in the AI market to the disruptions that emerged with the arrival of cloud computing and virtualization years ago. “We let chaos reign and then we had to rein in the chaos,” Pandey said. “The word that people started using was server consolidation and virtualization.” The answer to the token problem, he said, is the same: “Anything in systems can be solved with caching and indirection.” DevRev, for example, is building a memory layer between AI agents and primary data sources, such as Salesforce or ERP records; that can cut token load and make data movement more efficient. The layer holds a knowledge graph with answers to common agent questions and runs on cheaper CPUs, avoiding more costly GPU cycles. Sending agents straight at systems like ServiceNow and Salesforce “will burn a lot more tokens. It’s also not precise. And finally, it’s not safe enough where I can roll it back in case an agent has committed a mistake,” Pandey said. Network automation firm NetBrains uses a different method: It uses conventional computing to map a network’s layout then feeds only key information to models for planning and reasoning, where AI excels. “So you don’t have to spend all the tokens,” said Netbrains CTO Sang Peng. Focus on prompt efficiency Staffing firm ManpowerGroup has found that prompt efficiency can be an effective tool for improving token use, both internally and externally for clients. For example, users accessing its internal labor-market tool initially needed 10 follow-up questions to drill into a query. A year later, more efficient use of prompts has brought that number down to an average of four, said Max Leaming, head of data science and AI solutions at ManpowerGroup. “They’re using fewer tokens and they’re simply more efficient,” he said. “And that in large part has to do with your ability to prompt efficiently.” Go local New AI hardware that generates free tokens at home could ease some of the cost crisis. At GTC Taipei earlier this month, Nvidia and Microsoft unveiled RTX Spark, an agentic AI desktop PC that runs agents and 120-billion-parameter models locally on Windows. The goal is “to deliver unmetered intelligence to every home and every desk with Windows,” Microsoft CEO Satya Nadella said in a statement. Some companies are looking to reduce cloud AI costs by putting their own hardware in data centers, with vendors such as HPE and Dell providing servers installed in independent facilities. (On-premise AI is gaining ground amid sovereign AI and geopolitical concerns, including the recent conflict in the Middle East, where large data centers were struck with missiles.) “There are local, region-specific and multiple vendor AI solutions. All of those things can help mitigate the risk. But they’re not going to eliminate it,” said Max Goss, senior director analyst at Gartner. Use forward-deployed engineers Reducing token costs is something that may fall to forward-deployed engineers (FDEs) in customer environments, said Taimur Rashid, managing director of AWS’s Generative AI Innovation Center. “I expect these teams to be able to architect systems that have those cost requirements in mind, whether it’s use a different model or a different use case that doesn’t increase the per-token cost,” Rashid said. Companies may spend heavily on token consumption, “but if you’re generating revenue, as long as the economics work out, then you’re at peace,” Rashid said. The use of FDEs is gaining ground as IT decision-makers look to both rollout successful AI deployments while also keeping an eye on costs. Change the measure of success from tokens to outcomes Even with the current emphasis on reducing token use to save money, the metrics used to measure AI success are likely to shift, Gartner’s Seth said. At some point, token-based pricing will move more toward an outcome-based model, where the unit of value is outcomes, not fragments of words. “Some companies are moving towards outcome-based pricing,” Seth said. “When people start realizing the real cost of tokens, then companies will start looking at token efficiency.”
Back to Technology
Technology
June 18, 2026 at 7:03 AM
How companies are racing to solve the AI token problem
Computerworld