Data Science x AI Series I: Evaluate…

Jun 16

Why, What and How

2 Comments

One thing I'd like to see more of in general (not just from you but across the industry) is moving away from only reporting noisy point estimates (eg Claude scored x, ChatGPT scored y on task z) and towards including std errors or other measures of uncertainty (building towards eg "there is (not) a 'statistically' 'significant' difference between models" etc).

Expand full comment

Reply (1)

Stella Liu

Jun 17

Love it! I asked the same question to an LLM evaluation startup two weeks ago. I am also exploring the "statistical significance" for sample size estimation. Love to share what I learnt and look forward to hear from others!

Expand full comment