I’m aiming to publish a full series by the end of 2025, all focused on how we can evaluate LLM‑powered products effectively. I’ll be adding links and digging deeper as each topic unfolds—and I'd love your input!
What would you like to see covered?
Leave a comment below with your thoughts—or drop me a message on Substack or LinkedIn if you’re also exploring LLM or GenAI evaluation. It’d be awesome to connect and learn from each other!
Proposed Series Outline
Metrics
LLM Guided Evaluation & DeepEval
Curate Your Own Benchmark Datasets
Can Red Teaming Be Automated?
Human in the Loop
The Ultimate Test: Field Experiment
Pre-deployment Evaluation: Combining Automated & Human Evaluations
Real-time Monitoring & Alerting
User Sentiment Analysis
This series from Data Science x AI is free. If you find the posts valuable, becoming a paid subscriber is a lovely way to show support! ❤
One thing I'd like to see more of in general (not just from you but across the industry) is moving away from only reporting noisy point estimates (eg Claude scored x, ChatGPT scored y on task z) and towards including std errors or other measures of uncertainty (building towards eg "there is (not) a 'statistically' 'significant' difference between models" etc).