Labs

What we're trying between engagements.

Internal experiments. Some land in client work. Some get shelved. We share them so you can see how we think about the problems we're paid to solve — and how we keep getting better at them.

Cross-tenant evaluation suite
In progress
A single eval harness that scores AI agents across anonymized client deployments — same test cases, different prompts and contexts. Lets us catch drift across the whole portfolio in one place.
Why: Drift is the boring failure mode that kills agents. Centralized eval lets a 6-person team watch 20+ agents responsibly.
Synthetic conversation generator
In progress
Programmatically generate adversarial conversations for support agents — confused users, multi-turn manipulation, off-policy requests. Used to grow eval suites without waiting for production failures.
Why: 20 hand-picked test cases cover the obvious. Synthetic generation extends coverage into the long tail.
Open-source on-prem reference stack
In progress
End-to-end on-prem AI stack — vLLM for inference, pgvector for retrieval, n8n for orchestration, Langfuse for observability. We deploy variants of this for clients who can't send data to a cloud LLM.
Why: On-prem AI is harder than people admit. Having a reference deployment shortens client engagements meaningfully.
Agent cost telemetry SDK
Shipped
Lightweight TS/Python library that wraps Anthropic + OpenAI calls and emits token cost per conversation, per user, per workflow to your observability tool. We use it in every engagement now.
Why: Clients deserve to know what every conversation cost. Most observability tools don't make that easy.
Multi-modal QC for SMB manufacturing
In progress
YOLO + vision-LLM hybrid for low-volume QC — uses YOLO for the common defects (fast, cheap) and a vision-LLM for the uncertain cases (slower, smarter). Pilot running in two plants.
Why: Pure YOLO needs too much labeled data for SMB volumes. Pure VLM is too slow for production lines. Hybrid is the right answer.
Local-first agent demos
Shelved
Idea: tiny agents that run fully in-browser via WebGPU for prospects who want to test without sending data anywhere. Shelved because the model quality at WebGPU-sized inference isn't good enough yet.
Why: We'll revisit when the local model quality catches up — probably late 2026.
Agent contract testing
In progress
Treating agents like services with contracts. Schema-validated tool calls, OpenAPI-style interface docs, contract tests in CI that fail builds if the agent breaks its interface.
Why: Agents-as-services scales better than agents-as-magic. The boring infrastructure pattern is the right one.