AI Evals is boring

Or ... is it?

Jan 21, 2026

Two years ago, my podcast partner Stella Liu (yes, the Stella who’s been writing this Substack newsletter for you every week) told me she was working on AI evals.

I said, “AI evals sounds boring.”

I was at a startup building AI products with the latest SOTA language models.

ChatGPT, finish this building. : r/singularity — Building with latest SOTA language models

Meeting notetakers. Customer service agents. AI coding assistants. AI presentation makers.

I was obsessed with catching up with the latest model release, trying out hottest AI products, staying on top of every trend.

Honestly, I thought my job was way cooler.

Did My Career Become a Gamble

Here’s the thing about building products since the end of 2022: it’s never been easier. OpenAI API and later open-source pre-trained models are easily accessible. Anyone with basic engineering skills can vibe code a working product over a weekend.

I was running on the same hamster wheel as other tech workers who transitioned from traditional roles to AI. Ship fast with the latest models. Pray it works.

Every day did feel exciting. We built new AI features with AI. It kind of worked. Then we took the new products and started chasing after customers. My day started and ended with reading the latest AI news and researching other new AI products.

However, the exhaustion crept in slowly. Not burnout exactly. But I started questioning:

Am I actually excited about every single product? Or is this just FOMO?
If anyone can build the same thing I’m building, what’s my unique value? What’s the product’s unique value?
How is chasing trends going to build a career in AI?

Let’s be honest. A lot of us simply “ship and pray” that our AI product works, hoping the AI doesn’t hallucinate something embarrassing. Next, we hope customers will trust it?

That’s a gamble.

I do not want my career to be a gamble.

Gambling GIFs | Tenor — Isn’t ship-and-pray the same as gambling?

AI Evals is not boring

The signals were there all along. I just chose to ignore them.

I talked to customers constantly at my job. And the message was consistent: they had mixed feelings about AI products. They wanted to leverage AI capabilities without jeopardizing their reputation.

It took me 2 years to realize what should have been obvious from day one.

Building AI products is still just… building products. I’ve got to come back to the basics: customer trust is still the most important thing with or without AI. It’s the lesson learned after struggling with building multiple AI products and customer calls. Without demonstrating the reliability of AI features, you can’t gain customer trust.

It’s not just me. This is actually the trust gap that’s slowing AI adoption across industries. Whether a product has rigorous evals in place is the differentiator across small and big organizations.

At small startups, people may think they can still ship products quickly without evaluations and analytics until their first big customer questions whether they can control their product’s uncertainty. At this point, the only key difference between you and your competitor is who can prove their product’s reliability.

At large corporations, without proving your AI product/feature can work reliably, management won’t risk the company’s reputation and let you put the AI feature out to the public. To get buy-ins at your organization and take AI products beyond prototype phase, pre-release evals and post-production analytics are key.

Turns out AI evals is not boring. It is now the key differentiator between a working product and a lasting product.

Getting Off the Hamster Wheel

Get off the hamster wheel! How to quiet quit absolutely everything | Work & careers | The Guardian

On the other hand, at a personal level, coming from a data science background, AI evals feels quite familiar. Just like data science, AI evals work is fundamentally cross-disciplinary which I’ve always enjoyed and believe is one of the best ways to increase your impact and visibility in an organization.

This list is definitely not comprehensive, but you need to understand:

What users actually need (product)
What could go wrong (domain expertise)
How to measure product reliability and performance before and after release (data/ML)
What the stakes are (business/legal)

I just wish someone had told me that two years ago.

But maybe I wouldn’t have listened. Sometimes you need to run the hamster wheel yourself to realize you’re going nowhere.

Now when people tell me they’re building AI products, I don’t ask what models they used. I ask how they know it works. That question changes everything.

Data Science x AI

Discussion about this post

Ready for more?