LobstersSunday · July 5, 2026FREE

Better Models: Worse Tools

claudetool-callingagentsllm-reliability

Armin Ronacher describes a bug where newer Claude models (Opus 4.8 and Sonnet 5) generate malformed tool calls for Pi's edit tool, adding extra keys like "requireUnique" to the edits[] array. The edit content itself is usually correct, but the arguments violate the schema, causing Pi to reject the call and request a retry. This behavior is not observed in older Anthropic models, suggesting a regression in tool-calling reliability with newer SOTA models. Ronacher explains that tool calls are generated as text via in-band signaling, with the model emitting a structured format (resembling XML with JSON for complex parameters). Without grammar-aware constrained decoding, the model merely follows learned conventions and can invent invalid keys. The post highlights that this issue is specific to the edit tool's nested array schema and that the problem is worsening with model updates, not improving.

// why it matters

Newer SOTA models can be worse at specific tool schemas, breaking agent reliability.

Sources

Primary · Lobsters

▸ Read original at lucumr.pocoo.org

sqlite-utils 4.0rc2, mostly written by Claude Fable (for about $149.25)AI-DLC: Giving Structure to AI-Assisted Development AgentGuard vs Semgrep vs CodeQL: 100 Percent vs 0 Percent on AI Agent Security

Better Models: Worse Tools

Sources

Related

Like this? Get the next digest.