Beyond Tokenmaxxing

May 28, 2026

Over the last year, AI adoption has accelerated. Engineers are using it for all sorts of things – from writing code to managing their digital lives. It feels like an entire generation of dormant builders suddenly came back to life.

And like every technological shift before it, enthusiasm quickly found a metric. Managers and leaders began building dashboards and internal tools to measure how thoroughly their teams had adopted it. Usage in token and $ spend became visible, and visibility became competition. Before long, people weren't just building; they were measuring how much AI they consumed while doing it.

“Meta also introduced internal dashboards to track employees’ consumption of “tokens,” a unit of A.I. use that is roughly equivalent to four characters of text, four people said. Some said the dashboards were a pressure tactic to encourage competition with colleagues. That led some employees to make so many A.I. agents that others had to introduce agents to find agents, and agents to rate agents, two people said1

“At Anthropic, a single user of the company’s A.I. coding system, Claude Code, racked up a bill of more than $150,000 in a month2

While consumption became measurable, impact on the other hand did not. A developer running an agent loop and a developer debugging production incident could look identical on the spend dashboard. They spent the same tokens but the value they got out is completely different.

Token (or $ spend) data was made available to answer the question “how much did we use,” it was built for billing, not management or leaderboards and some companies3 are already starting to realize this disconnect. According to Aishwarya Sankar of Entillegence AI4, 82% of the token spent never makes it to the product and over 44% is spent on bug fixes, potentially created by AI agents on the loose.

The challenge is no longer getting engineers to use AI, it is understanding whether the usage is creating meaningful leverage.


Measuring Leverage

Instead of tracking consumption, we should track how much leverage it creates for developer work. One approach may be to weigh developer activities based on two variables: how frequently they occur and how much time or friction they remove.

The most valuable workflows are often not the most expensive ones, but the ones that repeatedly remove friction from high-frequency tasks.

By weighing activities that represent real developer work (code changes, context gathered, investigations) and dividing by cost, we get a signal that reflects purposeful spend rather than raw consumption. It doesn’t penalize high usage. A developer doing meaningful work at scale should score well. What it penalizes is waste: agent loops, redundant scans, and tokenmaxxing that appears productive on a dashboard but quietly is not.

Activity Frequency Time Saved Weight Data
Lines added Extremely high Extremely high 1 GitHub
Lines removed High Extremely high 0.875 GitHub
Refactoring
(updates, deletes)
Medium Extremely high 0.75 GitHub
Skill invocations Extremely high Extremely high 1 Vendor
MCP invocations High Extremely high 0.875 Vendor

Where:
Weight = ( Frequency + Time Saved ) / 2
Using: Low = 0.25, Medium = 0.5, High = 0.75, Extremely high = 1

Merged code changes

This is one of the most common use cases. Something developers do almost every day. In my opinion, lines removed are almost as valuable as lines added but most metrics ignore it entirely. Deletions and renames feel cheap when delegated to Claude because a single instruction can touch dozens of files. But the decision to delete still requires judgment, and the value of that judgment is high.

Skills and MCP invocations

Skills are scoped, reusable, context-aware workflows that make repeated developer tasks faster, more consistent, and less manual. MCPs (or connectors) are especially valuable because they gather and reconstruct context across systems. In many engineering workflows, context retrieval may be harder and more time-consuming than content generation itself. Some vendors5 already expose parts of this data through analytics APIs, which can serve as a useful starting point for measuring leverage in developer workflows.

A simple heuristic

Using the above signals, we can construct a simple heuristic for estimating leverage relative to cost.

Leverage Score = Σ(activity × weight) / cost ($)

Let’s look at the example:

Activity Developer A Developer B
Lines added 2000 1500
Lines removed 50 75
Refactoring
(updates, deletes)
20 10
Skill invocations 2 10
MCP invocations 4 8
Cost $500 $200
Score 4.13 7.95

Both developers produced similar output, but Developer B achieved significantly higher leverage relative to spend. The difference was not raw activity, but how effectively AI was used to reduce friction, reuse workflows, and gather context.


A step further

Tools like Traces could become an observability layer for human-agent collaboration. Sessions, attachments, tool invocations, and execution traces help reconstruct how work actually happened rather than simply measuring how many tokens were consumed.

Better visibility shifts the optimization target from activity to effectiveness. The goal here is not to create a perfect productivity metric. Those likely do not exist. The goal is to avoid mistaking consumption for effectiveness.


References

  1. Meta’s Embrace of A.I. Is Making Its Employees Miserable
  2. More! More! More! Tech Workers Max Out Their A.I. Use
  3. Uber's COO says it's getting harder to justify the money spent on AI tokenmaxxing
  4. Aishwarya Sankar on X
  5. Claude Enterprise Analytics API Reference Guide