Data Science x AI Series I: Evaluate LLM-powered Products

Why, What and How

Jun 16, 2025

I’m aiming to publish a full series by the end of 2025, all focused on how we can evaluate LLM‑powered products effectively. I’ll be adding links and digging deeper as each topic unfolds—and I'd love your input!

What would you like to see covered?
Leave a comment below with your thoughts—or drop me a message on Substack or LinkedIn if you’re also exploring LLM or GenAI evaluation. It’d be awesome to connect and learn from each other!

Proposed Series Outline

Why & What to Evaluate: LLM-powered Products v.s. LLMs
Metrics
Single-turn v.s. Multi-turn
Test set size estimation
Curate Your Own Benchmark Datasets
Can Red Teaming Be Automated?
Human in the Loop
The Ultimate Test: Field Experiment
Pre-deployment Evaluation: Combining Automated & Human Evaluations
Real-time Monitoring & Alerting
User Sentiment Analysis

This series from Data Science x AI is free. If you find the posts valuable, becoming a paid subscriber is a lovely way to show support! ❤

Support the project!

Neal

Jun 16

One thing I'd like to see more of in general (not just from you but across the industry) is moving away from only reporting noisy point estimates (eg Claude scored x, ChatGPT scored y on task z) and towards including std errors or other measures of uncertainty (building towards eg "there is (not) a 'statistically' 'significant' difference between models" etc).

Expand full comment

1 reply by Stella Liu

1 more comment...

Data Science x AI

Data Science x AI Series I: Evaluate LLM-powered Products

Why, What and How

Proposed Series Outline

Discussion about this post