Building an LLM Test Set with Sourcetable
I recently came across Sourcetable, an AI-powered spreadsheet and data platform. It connects directly to live data sources (databases, SaaS apps, etc.), then lets you explore with natural language or SQL, build dashboards, and automate reporting - all in a familiar spreadsheet interface.
An Atypical Use Case
It feels like a tool meant to replace data analysts for quick analysis or simple ETL. If you’re curious, you can check out their tutorial videos. But for me, I decided to test it for something different, something that fits the theme of my Substack - LLM product evaluation!
I started with the AI Assistant chat window. My prompt was simple:
“I want to create a test set for LLM product safety evaluation.”
Sourcetable generated 19 prompts for me, each with a category, subcategory, risk level, expected behavior, and evaluation criteria.
Generating Evaluation Steps
Great! The outputs looked solid. The expected_behavior column is especially helpful. but I usually evaluate models with DeepEval, which requires clearly written, step-by-step evaluation instructions.
So I tried to see if Sourcetable could help. Using the command + K shortcut, I selected Clean data and asked it to convert the expected_behavior column into evaluation steps. My first try was too detailed, the system stalled for a few minutes without a response.
I simplified my prompt and tried again. This time, it worked. The tool converted the expected behaviors into numbered evaluation steps. A few of the last records had mistakes, but with another quick Clean data correction, it fixed them.
🙌 Success.
Now I have a small test set for safety evaluation for my LLM product! It wasn’t perfect, but it was surprisingly effective for something outside its typical use case.
If you’re curious to try it yourself, check it out using their free tier: https://sourcetable.com/?via=datascience
Testing AI Data Science product is fun! I plan to do more of this! 🙌
Affiliate note: This post contains a link to Sourcetable. If you sign up through my link, I may earn a small commission at no extra cost to you. I only share tools that I personally try and find useful.
If you enjoyed this experiment, you might also like my recent post Evaluating Safety In AI Products, which is part of an ongoing series on LLM product evaluation, where I dig into frameworks, test sets, and real-world challenges of evaluating AI systems.



coool