darxai: engineering, AI, and cybersecurity darxai
Back to blog
Context engineering for SMBs: how to lower AI costs without losing quality

AI 4 min read

Context engineering for SMBs: how to lower AI costs without losing quality

Practical techniques to reduce inference cost in AI projects: efficient code search, model routing, desktop automation, and reusable context patterns. Data from Semble, agent-desktop, and hybrid approaches.

In this article +

As AI projects move from pilot to daily operations, the bottleneck shifts from model quality to cost per session. In recent weeks, several tools documented notable reductions: Semble reports up to 98% fewer tokens in code search versus grep+read, agent-desktop reports 78-96% in desktop automation, and hybrid approaches like DeepClaude promise up to 17x cost reduction for low-risk tasks.

For an SMB, the practical question is not which model is most powerful. It is which combination keeps acceptable quality at a sustainable cost.

Short answer

Inference cost drops through context engineering: give the model only what is needed, in the most compact format possible, and delegate to a cheaper model what does not require the expensive one. Quality holds if routing and reduction decisions are backed by evals and observability.

Where money goes

Cost sourceExample
Full file readsAgent doing grep + opening 20 whole files to find a function
Full app tree dumpsDesktop automation serializing the whole UI on every step
Long conversations without summariesSessions dragging irrelevant history turn after turn
Premium models for mechanical tasksMass refactor run with the most expensive model available
Retries without policyLoops re-calling the model on every trivial failure

Each one has a different optimization lever.

Techniques with measurable impact

TechniqueReported savingWhen it applies
Embedding + BM25 retrieval (Semble-style)Up to 98% in code searchCode search in mid-to-large repos
Accessibility tree snapshot (agent-desktop-style)78-96% in desktop automationSlack, Notion, VS Code, and similar control
Incremental context summary30-60% in long sessionsAny assistant with turn memory
Difficulty-based model routingUp to 17x depending on mixHeterogeneous tasks with quality margin
Response cachingVariableRepeated prompts or stable templates
Versioned Skills and AGENTS.mdIndirect but highTeams with reusable context

These numbers are indicative. Real savings depend on stack, problem nature, and team discipline.

Routing map

flowchart LR
  A[Incoming task] --> B{High risk or quality?}
  B -- Yes --> C[Premium model]
  B -- No --> D{Templated or repetitive?}
  D -- Yes --> E[Cheap model + cache]
  D -- No --> F[Mid-tier model]
  C --> G[Human validation]
  E --> H[Automated validation]
  F --> H

The goal is to send each task to the cheapest model that meets the quality bar. It is not a technical dogma, it is a cost decision.

Reducing context without losing quality

PatternWhat it doesWhen to avoid
Top-k retrievalReturns only the most relevant fragmentsWhen the answer needs global vision
Rolling summaryCompresses old turnsWhen key references from the start are lost
Strict schemasBounds output to what is neededExploratory tasks
Versioned SkillsReuses repo conventionsWithout discipline, they become stale
Explicit memoryStores project decisionsWithout review, fills with noise

Common principle: every token sent to the model must justify its existence. Anything that does not contribute to the answer goes.

Typical SMB case

ContextBeforeAfter
Internal RAG assistantEach query does global grepEmbedding + BM25 indexing
Slack and Notion automationScreenshot and OCRAccessibility tree snapshot
Weekly mechanical refactorPremium model with long promptsMid-tier model with template and diff
Report generationLong chain without cacheFixed template + caching per type
Customer supportConversation without compressionRolling summary every 10 turns

The sum of small optimizations usually pays more than replacing the main model.

Metrics that matter

MetricWhat it indicatesHow to measure
Cost per closed taskReal efficiencyTotal cost / completed and accepted tasks
Cost per useful turnSpots wasteful sessionsCost / messages with delivered value
Acceptance ratePerceived quality% of responses used without rework
Average latencyUser experienceTime from request to final answer
Fallback rateRouting stability% of tasks escalated to a more expensive model

Without these metrics, cost optimization becomes a feeling. With them, it becomes a decision.

Common mistakes

  1. Switching models before reducing context.
  2. Aggressive caching without clear invalidation.
  3. Automatic routing without evals to catch regressions.
  4. Confusing “fewer tokens” with “less quality” without measuring it.
  5. Reducing context for tasks that need global vision.
  6. Optimizing the main model and ignoring noise in long sessions.

Hard rules to keep quality

  1. Every saving technique passes an eval before rollout.
  2. Routing decisions are auditable.
  3. Context compression is tested against real edge cases.
  4. The cache has an explicit invalidation policy.
  5. Critical tasks may jump to the premium model when risk justifies it.
  6. Observability covers cost, latency, and quality equally.

Progress indicators

IndicatorGoodBad
Measured token reductionBefore/after data per flow”It feels cheaper”
RoutingExplicit, reviewed rulesInherited config without review
EvalsSuite running on every changeSporadic manual checks
LatencyWithin agreed business SLAVariable, not measured
Cost per taskSustained downward trendOnly the monthly bill is checked

Final criterion

Inference cost is an engineering problem, not a model problem. When an SMB measures well and reduces context with discipline, the bill drops without sacrificing quality. The difference between “expensive” AI and sustainable AI lies in who decides what reaches each model and why.

Working sources

  • Public documentation of Semble on Model2Vec, BM25, and RRF indexing.
  • Public documentation of agent-desktop on accessibility tree snapshots.
  • Best practices for retrieval and prompt engineering in production LLM projects.
  • Technical decisions must be adapted to each company’s stack, criticality, and volume.

Next step

Apply ai automation to your company?

We automate repetitive processes with applied AI, agents, RAG, and integrations so your team works with less friction and more control.