Measuring Safety In AI Products
Protecting Users Before and After Launch
Hey there! This is the third post in my series on evaluating LLM-powered products, part of my ongoing effort to rediscover what data science means in the AI era. If you're interested in this topic, subscribe to get updates!
I’d love to hear your thoughts, experiences, or even disagreements. Drop a comment below, or reach out on LinkedIn. Let’s keep learning and figuring this out together!
When I First Started Working on GenAI Evaluation. Leadership told me:
“It’s not the end of the world if our AI gives bad answers — but if it outputs discriminatory content or leaks user data, we’re doomed.”
That stuck with me.
Although my first post in this series focused on accuracy metrics, I firmly believe that accuracy isn’t the most important metric. Safety is.
But “safety” means different things to different people.
Privacy & High-Stakes Risks
Among different safety metrics, privacy and high-stakes risks top the list.
Privacy Violations: Does it protect sensitive or personally identifiable information (PII)?
High-Stakes Risk Handling (okay, we admit — we need a better name for this!): How does it handle sensitive, consequential situations like mental health, crime, discrimination, or other high-impact topics?
Today’s agentic AI systems can make external function calls, access databases, or retrieve real-time data. As a result, safeguarding private information has become more complex and critical. We run dedicated tests before every release so we’re not losing sleep worrying the AI might accidentally expose sensitive data.
A real-world example from higher education:
Many universities publish aggregate student enrollment data, often through public dashboards. While these datasets are typically safe, they come with a key caveat: if a filtered segment shows fewer than five individuals, it risks becoming personally identifiable. That’s why many institutions, like Arizona State University, automatically suppress reporting on any subgroup smaller than five—a small but essential safeguard.
On the high-stakes risk side, consider this:
Today’s students face enormous pressure, and youth mental health issues have surged over the past decade. Before AI, struggling students might turn directly to faculty or staff for help. Now, many will first interact with a chatbot. That bot must be empathetic, responsible, and able to guide students toward helpful resources—or at least know when to defer.
This is just one type of high-stakes risk; others include crime reporting, discrimination, harassment, or even suicide risk.
How Do We Evaluate Safety?
I can sum up how I do it in one sentence: LLM-guided evaluation.
Here’s the process:
Define your evaluation objectives.
Curate datasets that cover the key scenarios you want to test.
Use an LLM to automatically score responses.
But here’s the key difference from evaluating accuracy:
You’re not checking whether the AI’s output matches a ground-truth answer — you’re checking whether it meets your safety checklist.
For example:
Privacy checks: Are there any PII leaks? Any sensitive data exposures?
High-stakes checks: Does the bot express empathy? Encourage seeking help? Reference the right resources?
Clear, well-documented scoring rubrics are essential to ensure consistency across datasets and evaluators.
Dataset Curation
From my experience, the most challenging part of this process is dataset curation.
Sometimes, we’re lucky: there’s historical data we can repurpose for evaluation. But more often, we need to work closely with subject matter experts and stakeholders to collect meaningful test cases, or, when needed, design synthetic data that realistically simulates high-risk or edge-case scenarios.
Pre-Deployment Alone Isn’t Enough
Everything above falls under pre-deployment evaluation, these are the tests we run before an AI product goes live.
But here’s the truth:
Generative AI is… well, generative. No matter how hard we test it before release, things can still go wrong in production. Evaluation helps us find potential issues; it doesn’t inherently fix them.
And when it comes to safety, we can’t afford to get it wrong.
That’s why pre-deployment evaluation isn’t enough.
We also need real-time monitoring mechanisms to catch and mitigate risks as they happen.
Adding a Moderation Layer
One practical enterprise solution is implementing a moderation layer—an agent (or set of agents) that monitors outputs before they’re sent to users.
This layer can:
Scan for privacy violations or high-stakes risk patterns
Enforce hard policies and safety thresholds
Redirect or escalate when necessary
Feed incidents back into continuous improvement loops
……
In short, moderation acts as the last line of defense—and an essential partner to pre-deployment evaluations.
Okay, that’s it for now on safety evaluation! I hope you found it useful. If you have thoughts, experiences, or critiques, I’d love to hear from you—let’s keep the conversation going and learn from each other. Thanks for reading and for being part of this evolving journey!
This series from Data Science x AI is free. If you find the posts valuable, becoming a paid subscriber is a lovely way to show support! ❤






Dataset curation is often one of the lengthiest parts of any project for me. But, like any output, it can make or break the end product, in this case the safety of an AI product.
Thanks for your post - There's maybe more to AI safety than meets the eye - Having discovered a major technical issue - these are some additional thoughts to those above - about how AI regulation needs mechanism for people to register discovered issues as we do in medical devices or the aeronautical industry where safety is a prime issue.
https://kevinhaylett.substack.com/p/there-is-no-ai-safety