One thing I'd like to see more of in general (not just from you but across the industry) is moving away from only reporting noisy point estimates (eg Claude scored x, ChatGPT scored y on task z) and towards including std errors or other measures of uncertainty (building towards eg "there is (not) a 'statistically' 'significant' difference between models" etc).
Love it! I asked the same question to an LLM evaluation startup two weeks ago. I am also exploring the "statistical significance" for sample size estimation. Love to share what I learnt and look forward to hear from others!
One thing I'd like to see more of in general (not just from you but across the industry) is moving away from only reporting noisy point estimates (eg Claude scored x, ChatGPT scored y on task z) and towards including std errors or other measures of uncertainty (building towards eg "there is (not) a 'statistically' 'significant' difference between models" etc).
Love it! I asked the same question to an LLM evaluation startup two weeks ago. I am also exploring the "statistical significance" for sample size estimation. Love to share what I learnt and look forward to hear from others!