The Future of Testing LLMs: You Can’t Test Chaos with Constants
Software testing methods that have served us for decades are falling short with non-deterministic AI systems — here’s why fighting fire with fire might be our best path forward.
For decades, software engineers have relied on a simple truth: given the same input, a piece of code should produce the same output. This predictability has been the bedrock of software testing, enabling us to build increasingly complex systems with confidence. We write tests, run code, and get clear pass/fail results. It’s a methodology that has served us relatively well, and lead to common practices like test-driven development (TDD) and continuous integration.
But large language models (LLMs) have more or less shattered the robustness of this paradigm into a billion pieces. These AI systems, with their inherent non-determinism, respond differently to identical prompts, incorporate context in unexpected ways, and exhibit emergent behaviors that generally defy all of our traditional testing approaches. This unpredictability has left many businesses, companies, and projects hesitant to adopt these powerful tools, potentially due in part to the fact that they just can’t evaluate their implementation against clear success or failure criteria.
I’ve been thinking about this problem a lot lately, sparked by some AI-focused events hosted by AICamp (shout-out to them for hosting great events) in Chicago. Dean Wampler, IBM’s engineering lead for The AI Alliance which was founded recently in 2023, spoke about LLMs, testability, and incremental improvement. As someone who values software resiliency and empirical decision-making, these talks about the challenges applying determinism and existing testing methodologies to Gen-AI systems really got the gears turning in my head. One particular question he posed stuck with me:
“Without our familiar determinism, what design and test strategies should we use to rebuild the confidence our tests are supposed to provide?”
The software industry now faces an important question: How do we validate systems that are inherently unpredictable? The answer might require us to fundamentally rethink our approach to testing, moving away from the rigid benchmarks and academic evaluations that served us so beautifully in the past. Perhaps the solution lies not in fighting against the non-deterministic nature of these systems, but in embracing it.
Paradigm Shift
Software testing has always been about certainty. When a developer writes a unit test asserting that 2 + 2 equals 4, that test will pass or fail consistently. This predictability has shaped not just how we test software, but how we design it — through practices like dependency injection, interface segregation, and modular architecture.
But LLMs don’t play by these comforting rules. Ask ChatGPT to write a poem about software testing twice, and you’ll get two different results, often vastly different. Both might be perfectly valid, but neither will match your expected output exactly. This variance isn’t a bug — it’s a feature. These models are designed to be “creative”, to understand context, and to generate novel responses. They’re meant to surprise us. And to traditional testing methods, surprises are terrifying.
Traditional benchmark testing falls particularly flat here. Academic evaluations might tell us that a model scores 92% on a specific task — how perfect! — but what does that really mean? These benchmarks often reduce complex, nuanced interactions to simplified metrics that don’t capture the full picture.
Consider testing an LLM’s ability to write code: a benchmark might check if the code compiles and produces expected outputs, but it can’t evaluate readability, adherence to best practices, or maintainability. These qualitative aspects — often the most important ones — slip right through the cracks. Worse still, these benchmarks face rapid obsolescence as new model training inevitably incorporates the benchmark details themselves.
Rethinking Evaluation
When you first try out a new LLM — say a new competitor to ChatGPT has just hit the market — what’s your process? If you’re like me, you don’t start by checking benchmark scores or reading academic papers. Instead, you dive in and start actually using it. You ask questions, challenge responses, and naturally develop a feel for its capabilities and limitations. I think that this intuitive evaluation process actually hints at something profound about how we need to approach testing non-deterministic systems.
Humans excel at evaluating outputs that exist in shades of gray rather than binary states. We can instantly recognize when a response is technically correct but misses the point, or when it’s slightly imperfect but deeply insightful. We understand context, detect nuance, and judge appropriateness in ways that automated tests simply cannot match.
The key insight isn’t just that humans are good at evaluating LLM outputs — it’s that human feedback, when properly structured and scaled, could provide a more meaningful measure of improvement than any predetermined benchmark. Instead of asking “Does this model score better on academic test X?” we should be asking “Do users find this model more helpful in achieving their real-world goals?”
Trusting Your Gut (at Scale)
In my opinion, it seems we’re already seeing two distinct patterns emerge for scaling human feedback in LLM testing, and both show promise.
The first approach is what I’ll call the customer-driven feedback loop. OpenAI’s ChatGPT offers a perfect example. Have you noticed how it sometimes generates two responses and asks which one you prefer? That’s not just for show. They’re essentially running a massive, continuous A/B test, collecting real user preferences at scale. It’s brilliant in its simplicity: users naturally engage with the system, make quick comparative judgments, and provide valuable feedback without even feeling like they’re part of a testing process.
The second pattern I see taking hold is community-driven development. Take IBM and Red Hat’s InstructLab project (which I first heard of during one of the aforementioned events) as one example — though it’s certainly not the only one. InstructLab provides a framework where developers can contribute improvements through a familiar GitHub-esc workflow. Contributors create these things called “skill recipes” with example questions and answers, which are then used to generate synthetic training data. The community reviews these contributions, and approved changes are integrated into model updates.
What I think is fascinating is that both of these approaches acknowledge the same fundamental truth: you can’t effectively test a non-deterministic system with deterministic methods. Instead of fighting against this reality, these approaches lean into it, using human judgment as a tool rather than an obstacle.
Embracing the Unpredictable
In the end, I really do think that true solution may not be to try harder to make non-deterministic systems fit into our deterministic testing boxes. Instead, it may be to acknowledge that when we’re testing systems that “think” more like humans, we need evaluation methods that match that unpredictable nature.
This isn’t about outright abandoning any form of rigor, forbidding even a single element of determinism, or accepting complete chaos. It’s about recognizing that the most effective way to test an unpredictable system is likely with something equally unpredictable — human judgment, applied systematically and at scale. I personally think that the companies and projects that understand this first, and continuously and broadly apply it past the conceptual level and figure out how to harness human feedback effectively, will be the ones that define the next generation of AI development.
Sometimes the best way to test chaos isn’t with constants — it’s with chaos itself. Is it time for us to fight fire with fire?