Using DSPy to evaluate and improve Datasette Agent's SQL system prompts
Simon Willison conducted research leveraging the DSPy framework to evaluate and refine the core production system prompts used by Datasette Agent’s read-only SQL question answerer. The methodology involved a harness where DSPy agents invoked Datasette Agent’s actual tool implementations and prompts against a live in-process Datasette. A gold-standard, auto-generated dataset provided rigorous evaluation via custom metrics. This project was initiated following an AIE keynote on DSPy, leading to an asynchronous research task using Claude Code for web with Claude Fable 5. The task aimed to integrate DSPy to evaluate and improve the main system prompts for Datasette Agent's feature that executes read-only SQL queries to answer user questions about data. During the evaluation, Fable chose to test using GPT 4.1 mini and nano, identifying several promising directions for improvements. A key finding was that the schema listing provided to the agent gave only table names. The advice to "don't call describe_table if you already have the information" subsequently caused column-name guessing, such as `page_count`, `o.order_id`, and `first_name`, and resulted in error-retry loops in baseline traces. The research suggested that either column names should be included in the prompt's schema listing or the advice regarding `describe_table` should be softened to mitigate these issues.
Developers can use DSPy to systematically evaluate and improve AI agent prompts, preventing common errors like column-name guessing.