_ November 25, 2025

Why AI Testing Is Essential for AI Safety in Complex Systems?

In 2026, AI is moving from experimentation into operations, from pilots and proofs-of-concept to systems that influence real-world decisions. For leaders in safety-critical and high-consequence environments, that shift creates a new challenge: how do we ensure AI behaves safely in complex systems where failure has serious consequences? This shift toward operational AI also highlights the continued importance of strong software quality foundations. For leaders wanting a clearer view of how assurance practices evolve alongside AI, our guide on Software Quality Assurance for 2026 explores the principles that underpin safe, reliable system behaviour.

At the aSCSa conference in October 2025, KJR CTO Dr Mark Pedersen addressed this question in his session on AI and Safety. Drawing on delivery experience and his work contributing to the emerging ISO technical specification for AI testing (ISO/IEC TS 42119-2), he focused on a simple but often overlooked point:

“AI safety cannot be achieved through principles alone. It needs evidence. And testing is how that evidence is built.”

Dr. Mark Pedersen

CTO

AI safety is now a leadership responsibility

In traditional engineering, leaders can rely on well-established assurance approaches:

Specifications
Verification plans
Safety cases
Deterministic testing methods
Established failure analysis

AI, as outlined in our AI adoption framework, introduces a different kind of system behaviour and a different kind of risk.

Machine learning models are shaped by data rather than explicit rules. Their behaviour can change when inputs drift, when operating conditions differ from training assumptions, or when dependencies are updated. These characteristics don’t remove the need for assurance: they make it harder to achieve using familiar approaches.

As a result, decisions about AI safety increasingly sit with executives, not just technical teams. They affect:

Operational continuity
Regulatory outcomes
Public and stakeholder trust
Risk exposure and liability
Organisational reputation

The question leaders ultimately need to answer is not whether an AI system performs well in a test environment, but whether there is defensible confidence in how it will behave in operation, over time.

“Testing gives leaders the evidence to make risk decisions with confidence, not hope.” - Dr Mark Pedersen, CTO, KJR

Why AI behaves differently to conventional software?

One of the themes Dr Pedersen emphasised was that many assurance challenges arise from treating AI as if it were conventional software.

Traditional software is largely deterministic. Given the same input, it produces the same output. Testing focuses on verifying known behaviours against known requirements.

AI systems behave differently for several reasons:

Real-world data rarely matches training data exactly, leading to domain or distribution shift
Deep learning models are often opaque, limiting the usefulness of white-box testing
Outputs can be probabilistic, particularly for language-based systems
For many AI tasks, there is no single “correct” answer for every input
Increasing reliance on third-party or foundation models introduces hidden dependencies

These characteristics do not make AI untestable. They do mean that testing approaches must reflect how AI systems actually behave, rather than forcing them into frameworks designed for deterministic software.

What emerging standards tell us about where to focus?

International standards bodies are responding to these challenges. ISO/IEC TS 42119-2 extends the ISO 29119 software testing standards to explicitly address AI systems, using a risk-based approach.

The most important insight for leaders is not the existence of another standard, but the shift in emphasis it represents. Alongside familiar test levels, AI systems require focused attention on two areas that have historically received less scrutiny:

Data quality testing
Model testing

This reflects industry experience: many AI failures originate in the data used to train systems or in how models behave under real operating conditions, rather than in downstream integration.

Data quality testing: where many AI safety issues originate

In practice, teams often focus early on model performance metrics. However, for safety-critical applications, the more fundamental question is whether the data itself is fit for purpose.

Data quality testing provides evidence about:

Where data came from and how it is governed
Whether it represents the conditions the system will encounter in operation
Whether important scenarios are missing or under-represented
Whether inconsistencies or artefacts could lead to unsafe behaviour

AI systems cannot behave safely in environments their training data does not meaningfully represent.

Model testing: focusing on behaviour under pressure

In complex systems, conditions are rarely ideal. Inputs are noisy, sensors degrade, environments change, and edge cases emerge. Testing, therefore, needs to explore how models behave when assumptions are violated.

This includes examining:

Robustness to degraded or imperfect inputs
Stability across environmental variation
Whether performance degrades gradually or fails abruptly
Behaviour at operational boundaries

“In critical environments, predictable failure modes are safer than brittle performance that looks good on paper.” - Dr Mark Pedersen, CTO, KJR

From a leadership perspective, this kind of testing supports more informed risk decisions than performance metrics alone.

Testing systems where outputs are not fixed

As AI systems incorporate large language models and agent-based architectures, another challenge arises: outputs are not always repeatable.

In these cases, Dr Pedersen discussed the use of metamorphic testing. Rather than asserting exact outputs, this approach checks that certain relationships or invariants hold when inputs are varied in controlled ways.

In practical terms, this allows teams to test whether systems:

Route requests to the correct components
Produce outputs in the expected structure
Extract key parameters correctly
Fail safely or refuse actions when required

This approach has proven particularly useful in multi-agent systems, where safety depends more on correct orchestration and control than on identical responses.

AI safety is a lifecycle outcome

AI safety cannot be “completed” at deployment. AI systems can change behaviour due to retraining, data drift, dependency updates, or changes in upstream models. For organisations operating in regulated or high-risk environments, this makes ongoing testing and monitoring essential.

Effective AI assurance therefore extends into operations, supported by mechanisms such as:

Performance monitoring and drift detection
Change control and re-testing triggers
Fallback strategies and safety constraints

AI safety needs ongoing assurance as systems and conditions change.

What leaders should expect from AI testing

For executives, a credible AI testing strategy should deliver:

Clear evidence of system limitations and safe operating boundaries
Traceability between identified risks, tests, and mitigations
Confidence that critical scenarios have been examined
Assurance that safety claims remain valid as systems evolve

These expectations are increasingly aligned with regulatory developments, including requirements for high-risk AI systems under the EU AI Act.

How KJR supports AI safety through testing and assurance?

KJR works with organisations delivering AI in complex, safety‑critical environments to design test strategies that produce the evidence leaders need. Our broader AI & Data Services capability supports this work by helping teams build trustworthy systems from the ground up, while our Software Testing & Assurance practice ensures those systems behave reliably in real‑world conditions.

If your organisation is deploying AI into complex systems where outcomes matter, talk to KJR about building an AI testing strategy that strengthens AI safety and supports long‑term operational confidence.

KJR’s experience in critical environments

KJR supports organisations delivering complex systems where reliability and safety are non‑negotiable. Our teams work across sectors where operational pressures are real and the consequences of failure are significant. This includes long‑standing work in Defence, where system assurance underpins mission readiness; Healthcare, where safety and accuracy directly affect patient outcomes; Utilities, where stability and resilience are essential for public infrastructure; and Construction, where digital systems increasingly shape planning, safety, and on‑site decision‑making.

Across these industries, we apply testing and assurance practices shaped by real operational constraints, environments where systems must perform predictably, withstand variability, and remain trustworthy over time. This breadth of experience directly informs how we approach AI testing and AI safety in high‑consequence environments, ensuring our methods reflect not just theoretical best practice but the realities of complex, safety‑critical operations.

Building safe, reliable, and operationally trustworthy AI systems starts with the right testing strategy.
Connect with us to discuss how we can support your goals.

Gallery

Contacts

Why AI Testing Is Essential for AI Safety in Complex Systems?

AI safety is now a leadership responsibility

Why AI behaves differently to conventional software?

What emerging standards tell us about where to focus?

Data quality testing: where many AI safety issues originate

Model testing: focusing on behaviour under pressure

Testing systems where outputs are not fixed

AI safety is a lifecycle outcome

What leaders should expect from AI testing

How KJR supports AI safety through testing and assurance?

KJR’s experience in critical environments

SERVICES

INDUSTRIES

ABOUT

CONTACT US

Gallery

Contacts

AI safety is now a leadership responsibility

Why AI behaves differently to conventional software?

What emerging standards tell us about where to focus?

Data quality testing: where many AI safety issues originate

Model testing: focusing on behaviour under pressure

Testing systems where outputs are not fixed

AI safety is a lifecycle outcome

What leaders should expect from AI testing

How KJR supports AI safety through testing and assurance?

KJR’s experience in critical environments

Responsible AI Adoption: How Leaders Drive Real Impact

Why Australia’s New AI Guidance Changes How Organisations Adopt AI

SERVICES

INDUSTRIES

ABOUT

CONTACT US