Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize

Overview

This workshop by SallyAnn DeLucia and Fuad Ali from Arize demonstrates how to build a prompt learning optimization loop for AI agents. Prompt learning uses textual feedback and explanations to iteratively improve system prompts, going beyond traditional optimization methods that only focus on scores. The session covers why agents fail, introduces prompt learning methodology, and walks through a hands-on coding workshop implementing an optimization loop.

Watch the Video

Key Takeaways

Most agent failures stem from weak environments and instructions rather than weak models - focus on improving system prompts before reaching for fine-tuning or architecture changes
Prompt learning leverages rich textual feedback explaining WHY outputs failed, not just binary correct/incorrect labels - use human annotations and LLM-as-judge explanations to guide prompt optimization
Simple rule additions to system prompts can achieve dramatic improvements - adding engineering best practices as rules improved coding agent performance by 15% with no other changes
Continuous optimization beats static prompts - treat prompt optimization as an ongoing process that adapts to new failure patterns over time
Evaluator quality determines optimization success - invest equal effort in optimizing your evaluation prompts as your agent prompts since they provide the learning signal

Topics Covered

0:00 - Introduction and Speaker Backgrounds: Sally and Fuad introduce themselves and their experience building agents at Arize, setting context for the workshop
2:00 - Why Agents Fail Today: Core issues: weak environments/instructions, missing planning, inadequate tools, poor context engineering, and split responsibilities between technical and domain experts
5:00 - Prompt Learning Methodology: Comparison of reinforcement learning, meta-prompting, and prompt learning approaches. How prompt learning uses textual feedback and explanations
9:00 - Case Study: Coding Agent Optimization: Demonstration of 15% performance improvement on coding tasks by adding rules to system prompts, achieving GPT-4.5 level performance with GPT-4.1
12:00 - Addressing Overfitting Concerns: Why ‘overfitting’ to specific domains is actually expertise building, and how train/test splits ensure generalization
14:00 - Benchmarking Against Other Methods: Comparison with GenAI and other optimization techniques, showing prompt learning’s superior performance in fewer iterations
16:00 - Importance of Evaluation Quality: Co-evolving loops concept - optimizing both agent prompts and evaluator prompts for reliable feedback signals
23:00 - Workshop Setup and Code Walkthrough: Setting up the prompt learning repository, OpenAI API keys, and beginning the hands-on JSON webpage generation example
28:00 - Configuration and Data Preparation: Configuring sample sizes, train/test splits, optimization loops, and examining the training dataset structure
37:00 - Building Evaluators and System Prompts: Creating LLM-as-judge evaluators for output assessment and rule checking, setting up initial system prompts
43:00 - Optimization Loop Implementation: Core three-part process: generate and evaluate, train and optimize, iterate until threshold met or max loops reached