Stack
Proof metrics
Evaluation scorecard
Production eval dimensions — how this system is judged before and after changes ship.
Test set of damaged, faded, and angled warehouse labels from real captures
Structured reconciliation cases against ERP master data
Long-horizon walk sessions with intentional interruption and retry
Selective high-effort reasoning only on variance paths; routine OCR stays cost-efficient
Problem
Warehouse audits often run for weeks with multiple field staff and heavy manual reconciliation in spreadsheets.
Damaged, faded, or angled labels fail frequently on traditional handheld scanners, creating repeated variance loops.
Audit workflows require sustained context across long sessions, not isolated one-shot API calls.
Solution
Built a stateful agent workflow for end-to-end warehouse walk sessions with resumable progress.
Used high-fidelity model vision to extract bin and label data from low-quality real-world images.
Applied selective high-effort reasoning only for variance classification while keeping routine OCR paths cost-efficient.
Added self-verification before report output to improve confidence for enterprise audit handoff.
Technical deep-dive
Stateful session design
Warehouse audits are not single-turn Q&A. A walk spans dozens of bins, intermittent connectivity, and operator pauses. WaybillAgent models the audit as a resumable session with explicit workflow states: capture, extract, lookup, reconcile, tag variance, and report.
Each state has clear entry/exit criteria and persistence so operators can stop mid-aisle and continue without losing context — a requirement production agents ignore in demos.
Selective reasoning and cost tradeoffs
Not every bin needs Opus-level reasoning. Routine label extraction runs on a cost-efficient vision path; high-effort reasoning activates only when variance classification or ambiguous reconciliation requires it.
This pattern keeps per-walk token spend predictable while preserving accuracy on the cases that actually block audit sign-off.
Self-verification before handoff
Before generating the variance report, the agent runs a self-verification pass: cross-check extracted codes against lookup results, flag low-confidence extractions, and surface items that need human review.
Enterprise audit handoff cannot tolerate silent failures — verification checkpoints are part of the workflow, not an afterthought.
Architecture
Capture layer: phone/meta glasses image capture during aisle walkthrough.
Interpretation layer: Claude Opus vision + extraction pipelines for labels and bin codes.
Agent layer: managed multi-step session coordinating scan, lookup, reconciliation, and variance tagging.
Data layer: ERP/master-data reconciliation plus structured variance report output.
Outcomes
Proved a practical AI-first audit workflow that can run in real warehouse conditions in Nairobi.
Demonstrated operational viability for long-horizon agent sessions and resume/retry behavior.
Established a flagship product proof for forward-deployed AI engineering in East African enterprise environments.
Links & artifacts
Related work
AIDC Barcode Toolkit
Open-source toolkit that packages real-world AIDC workflows so Claude Code can generate, validate, and reason about barcode and labeling tasks with domain-correct defaults.
Read case studyOutcome-Driven Agent Evaluation (Hive)
Exploration and extension of the Hive framework for outcome-driven agent development, focusing on how teams iterate when success is measured by business results rather than single-turn benchmarks.
Read case studyDiscuss this work
Hiring or building something similar—reach out with context and constraints.
Email Joseph