strict-intent-bench demo

Failure mode

Wrong intent inference is when an assistant answers a plausible implied request, but not the request the user actually made.

Intervention

Strict / Precision behavior aims to reduce wrong-intent behavior by avoiding unsupported assumptions about short, quoted, corrective, or context-dependent replies.

Current weakness

The trade-off is over-clarification: strict behavior can ask too many questions when the user's selection is already clear.

Existing measured summary

Bars use existing v0.2 80-case run summaries in this repository. Lower is better for wrong intent inference and unnecessary clarification.

Side-by-side examples

These are illustrative manually written examples, not measured API outputs. They show the intended failure pattern before running a full evaluation.

Data sources: docs/demo_cases.json and docs/examples.json. Regenerate benchmark cases with make demo-data or python tools/export_demo_cases.py.