Discussion about this post

User's avatar
Neal's avatar

One thing I'd like to see more of in general (not just from you but across the industry) is moving away from only reporting noisy point estimates (eg Claude scored x, ChatGPT scored y on task z) and towards including std errors or other measures of uncertainty (building towards eg "there is (not) a 'statistically' 'significant' difference between models" etc).

Expand full comment
1 more comment...

No posts