Over 10 years we help companies reach their financial and branding goals. Engitech is a values-driven technology agency dedicated.

Gallery

Contacts

411 University St, Seattle, USA

engitech@oceanthemes.net

+1 -800-456-478-23

Article
Why AI Testing Is Essential for AI Safety in Complex Systems Banner

Why AI Testing Is Essential for AI Safety in Complex Systems?

Article
Why AI Testing Is Essential for AI Safety in Complex Systems Banner
In 2026, AI is moving from experimentation into operations, from pilots and proofs-of-concept to systems that influence real-world decisions. For leaders in safety-critical and high-consequence environments, that shift creates a new challenge: how do we ensure AI behaves safely in complex systems where failure has serious consequences? This shift toward operational AI also highlights the continued importance of strong software quality foundations. For leaders wanting a clearer view of how assurance practices evolve alongside AI, our guide on Software Quality Assurance for 2026 explores the principles that underpin safe, reliable system behaviour.

At the aSCSa conference in October 2025, KJR CTO Dr Mark Pedersen addressed this question in his session on AI and Safety. Drawing on delivery experience and his work contributing to the emerging ISO technical specification for AI testing (ISO/IEC TS 42119-2), he focused on a simple but often overlooked point:

“AI safety cannot be achieved through principles alone. It needs evidence. And testing is how that evidence is built.”

Dr. Mark Pedersen
CTO

AI safety is now a leadership responsibility

In traditional engineering, leaders can rely on well-established assurance approaches:

  • Specifications
  • Verification plans
  • Safety cases
  • Deterministic testing methods
  • Established failure analysis

AI, as outlined in our AI adoption framework, introduces a different kind of system behaviour and a different kind of risk.

Machine learning models are shaped by data rather than explicit rules. Their behaviour can change when inputs drift, when operating conditions differ from training assumptions, or when dependencies are updated. These characteristics don’t remove the need for assurance: they make it harder to achieve using familiar approaches.

As a result, decisions about AI safety increasingly sit with executives, not just technical teams. They affect:

  • Operational continuity
  • Regulatory outcomes
  • Public and stakeholder trust
  • Risk exposure and liability
  • Organisational reputation

The question leaders ultimately need to answer is not whether an AI system performs well in a test environment, but whether there is defensible confidence in how it will behave in operation, over time.

“Testing gives leaders the evidence to make risk decisions with confidence, not hope.” - Dr Mark Pedersen, CTO, KJR

Why AI behaves differently to conventional software?

Mark Pedersen Speaking

One of the themes Dr Pedersen emphasised was that many assurance challenges arise from treating AI as if it were conventional software.

Traditional software is largely deterministic. Given the same input, it produces the same output. Testing focuses on verifying known behaviours against known requirements.

AI systems behave differently for several reasons:

  • Real-world data rarely matches training data exactly, leading to domain or distribution shift
  • Deep learning models are often opaque, limiting the usefulness of white-box testing
  • Outputs can be probabilistic, particularly for language-based systems
  • For many AI tasks, there is no single “correct” answer for every input
  • Increasing reliance on third-party or foundation models introduces hidden dependencies

These characteristics do not make AI untestable. They do mean that testing approaches must reflect how AI systems actually behave, rather than forcing them into frameworks designed for deterministic software.

What emerging standards tell us about where to focus?

International standards bodies are responding to these challenges. ISO/IEC TS 42119-2 extends the ISO 29119 software testing standards to explicitly address AI systems, using a risk-based approach.

The most important insight for leaders is not the existence of another standard, but the shift in emphasis it represents. Alongside familiar test levels, AI systems require focused attention on two areas that have historically received less scrutiny:

  • Data quality testing
  • Model testing

This reflects industry experience: many AI failures originate in the data used to train systems or in how models behave under real operating conditions, rather than in downstream integration.

Data quality testing: where many AI safety issues originate

In practice, teams often focus early on model performance metrics. However, for safety-critical applications, the more fundamental question is whether the data itself is fit for purpose.

Data quality testing provides evidence about:

  • Where data came from and how it is governed
  • Whether it represents the conditions the system will encounter in operation
  • Whether important scenarios are missing or under-represented
  • Whether inconsistencies or artefacts could lead to unsafe behaviour

AI systems cannot behave safely in environments their training data does not meaningfully represent.

 

Model testing: focusing on behaviour under pressure

In complex systems, conditions are rarely ideal. Inputs are noisy, sensors degrade, environments change, and edge cases emerge. Testing, therefore, needs to explore how models behave when assumptions are violated.

This includes examining:

  • Robustness to degraded or imperfect inputs
  • Stability across environmental variation
  • Whether performance degrades gradually or fails abruptly
  • Behaviour at operational boundaries

“In critical environments, predictable failure modes are safer than brittle performance that looks good on paper.” - Dr Mark Pedersen, CTO, KJR

From a leadership perspective, this kind of testing supports more informed risk decisions than performance metrics alone.

Testing systems where outputs are not fixed

As AI systems incorporate large language models and agent-based architectures, another challenge arises: outputs are not always repeatable.

In these cases, Dr Pedersen discussed the use of metamorphic testing. Rather than asserting exact outputs, this approach checks that certain relationships or invariants hold when inputs are varied in controlled ways.

In practical terms, this allows teams to test whether systems:

  • Route requests to the correct components
  • Produce outputs in the expected structure
  • Extract key parameters correctly
  • Fail safely or refuse actions when required

This approach has proven particularly useful in multi-agent systems, where safety depends more on correct orchestration and control than on identical responses.

AI safety is a lifecycle outcome

AI safety cannot be “completed” at deployment. AI systems can change behaviour due to retraining, data drift, dependency updates, or changes in upstream models. For organisations operating in regulated or high-risk environments, this makes ongoing testing and monitoring essential.

Effective AI assurance therefore extends into operations, supported by mechanisms such as:

  • Performance monitoring and drift detection
  • Change control and re-testing triggers
  • Fallback strategies and safety constraints

AI safety needs ongoing assurance as systems and conditions change.

What leaders should expect from AI testing

For executives, a credible AI testing strategy should deliver:

  • Clear evidence of system limitations and safe operating boundaries
  • Traceability between identified risks, tests, and mitigations
  • Confidence that critical scenarios have been examined
  • Assurance that safety claims remain valid as systems evolve

These expectations are increasingly aligned with regulatory developments, including requirements for high-risk AI systems under the EU AI Act.

How KJR supports AI safety through testing and assurance?

KJR works with organisations delivering AI in complex, safety‑critical environments to design test strategies that produce the evidence leaders need. Our broader AI & Data Services capability supports this work by helping teams build trustworthy systems from the ground up, while our Software Testing & Assurance practice ensures those systems behave reliably in real‑world conditions.

If your organisation is deploying AI into complex systems where outcomes matter, talk to KJR about building an AI testing strategy that strengthens AI safety and supports long‑term operational confidence.

KJR’s experience in critical environments

KJR supports organisations delivering complex systems where reliability and safety are non‑negotiable. Our teams work across sectors where operational pressures are real and the consequences of failure are significant. This includes long‑standing work in Defence, where system assurance underpins mission readiness; Healthcare, where safety and accuracy directly affect patient outcomes; Utilities, where stability and resilience are essential for public infrastructure; and Construction, where digital systems increasingly shape planning, safety, and on‑site decision‑making.

Across these industries, we apply testing and assurance practices shaped by real operational constraints, environments where systems must perform predictably, withstand variability, and remain trustworthy over time. This breadth of experience directly informs how we approach AI testing and AI safety in high‑consequence environments, ensuring our methods reflect not just theoretical best practice but the realities of complex, safety‑critical operations.

Building safe, reliable, and operationally trustworthy AI systems starts with the right testing strategy.
Connect with us to discuss how we can support your goals.
Mark Pedersen profile image

Mark Pedersen

– CTO

Mark is an IT professional with a passion for digital culture. During the week, he leads his team at KJR through straight-talking solutions to software problems. In his spare time, he can be found crafting soundscapes and audio/visual installations that embrace technology’s propensity for playfulness and expression.

He first honed his aptitude for software innovation and assurance as a postgrad research assistant at the University of Queensland, where he went on to complete a PhD in Artificial Intelligence, focusing on language technology. It was there he befriended KJR founder Kelvin Ross and they bonded over a shared enthusiasm for fine-tuning software and risk mitigation strategies. He’s been with KJR ever since, taking a few years off to teach IT in the Middle East before returning in 2006 to open the Melbourne branch.

As CTO, Mark’s technical knowledge and lateral thinking abilities are counted on to lead critical software projects safely out of the red and into the light. And, as a world-class software risk analyst and advisor, he thrives on the satisfaction of bettering lives with technology that actually works.

Mark Pedersen is a member of the ACS Artificial Intelligence Ethics Committee.