Overview
This workshop by SallyAnn DeLucia and Fuad Ali from Arize demonstrates how to build a prompt learning optimization loop for AI agents. Prompt learning uses textual feedback and explanations to iteratively improve system prompts, going beyond traditional optimization methods that only focus on scores. The session covers why agents fail, introduces prompt learning methodology, and walks through a hands-on coding workshop implementing an optimization loop.
Key Takeaways
- Most agent failures stem from weak environments and instructions rather than weak models - focus on improving system prompts before reaching for fine-tuning or architecture changes
- Prompt learning leverages rich textual feedback explaining WHY outputs failed, not just binary correct/incorrect labels - use human annotations and LLM-as-judge explanations to guide prompt optimization
- Simple rule additions to system prompts can achieve dramatic improvements - adding engineering best practices as rules improved coding agent performance by 15% with no other changes
- Continuous optimization beats static prompts - treat prompt optimization as an ongoing process that adapts to new failure patterns over time
- Evaluator quality determines optimization success - invest equal effort in optimizing your evaluation prompts as your agent prompts since they provide the learning signal
Topics Covered
- 0:00 - Introduction and Speaker Backgrounds: Sally and Fuad introduce themselves and their experience building agents at Arize, setting context for the workshop
- 2:00 - Why Agents Fail Today: Core issues: weak environments/instructions, missing planning, inadequate tools, poor context engineering, and split responsibilities between technical and domain experts
- 5:00 - Prompt Learning Methodology: Comparison of reinforcement learning, meta-prompting, and prompt learning approaches. How prompt learning uses textual feedback and explanations
- 9:00 - Case Study: Coding Agent Optimization: Demonstration of 15% performance improvement on coding tasks by adding rules to system prompts, achieving GPT-4.5 level performance with GPT-4.1
- 12:00 - Addressing Overfitting Concerns: Why ‘overfitting’ to specific domains is actually expertise building, and how train/test splits ensure generalization
- 14:00 - Benchmarking Against Other Methods: Comparison with GenAI and other optimization techniques, showing prompt learning’s superior performance in fewer iterations
- 16:00 - Importance of Evaluation Quality: Co-evolving loops concept - optimizing both agent prompts and evaluator prompts for reliable feedback signals
- 23:00 - Workshop Setup and Code Walkthrough: Setting up the prompt learning repository, OpenAI API keys, and beginning the hands-on JSON webpage generation example
- 28:00 - Configuration and Data Preparation: Configuring sample sizes, train/test splits, optimization loops, and examining the training dataset structure
- 37:00 - Building Evaluators and System Prompts: Creating LLM-as-judge evaluators for output assessment and rule checking, setting up initial system prompts
- 43:00 - Optimization Loop Implementation: Core three-part process: generate and evaluate, train and optimize, iterate until threshold met or max loops reached