2 Comments
User's avatar
Neal's avatar

One thing I'd like to see more of in general (not just from you but across the industry) is moving away from only reporting noisy point estimates (eg Claude scored x, ChatGPT scored y on task z) and towards including std errors or other measures of uncertainty (building towards eg "there is (not) a 'statistically' 'significant' difference between models" etc).

Expand full comment
Stella Liu's avatar

Love it! I asked the same question to an LLM evaluation startup two weeks ago. I am also exploring the "statistical significance" for sample size estimation. Love to share what I learnt and look forward to hear from others!

Expand full comment