Reliability Testing

+44 (0)207 993 2287
Random failures undermine confidence. For high volume IT solutions they can create significant handlings costs and, in some cases, can overwhelm you causing a customer experience crisis. In real-time and embedded systems they can disrupt wider processes causing shutdowns and loss of service.

Special attention

Reliability requires special attention. It is often neglected. It is hard to do on an ad hoc basis. That failure, or burst of failures, the thing that gets reported every few weeks, is always hard to track down. There is never enough evidence at the scene. The problem can’t be reproduced.

The problem becomes a running sore. The costs of dealing with the aftermath keep coming back. A loss of confidence can lead the solution being rejected.

“Hidden” Variability

Reliability issues are hard to prevent and diagnose, because of “hidden” variability. Demands that are thought of as being “identical” actually have differences that cause the system to behave erroneously. Variability in the state of internal components at the time of execution can cause difference in the way demands that are identical are processed. Looked at from the outside these failures look like purely random events, sometimes it works, sometimes it doesn’t work.

Latent Danger

Reliability defects are a latent danger. “Normal” development and test practices miss them. If they are triggered during testing then they cannot be reproduced, it is too painful and too time consuming. They are closed, recorded as “tester errors” or “data issues” or as “environmental”. Then, once the system starts to get used in anger, they emerge, suddenly things that were known to work seem to have new issue.

Complexity

Complexity leads to unreliability. Unavoidable complexity comes in many forms, including:

  • Architectural - simple single threaded programs will be more reliable than distributed, multi-tier solutions with concurrent activity built using diverse technologies.
  • Solution - the more features and alternative ways of doing things, the more likely it is that something will trip up a down stream operation.
  • Domain - when inputs are highly variable and the handling rules have many conditions and outcomes “edge cases” have “edge cases” that real-world variation reveals.
  • Algorithmic - even trivial behaviours, that have simple explanations, can be challenging to implement in code and vulnerable to small variations in data or sequence.
  • Timing - once time and relative timing becomes part of the requirement a whole new world of opportunity for reliability failures emerges.

Reliability Assurance

Assuring the reliability of a solution involves identifying the reliability risks and the sources of such risks and ensuring they are tackled by design, in implementation and by test. Different systems have different risks and therefore need different approaches. Both the nature of the risks and the importance of reliability will affect the sorts of activities undertaken to address reliability.

Reliability Testing Solutions

The basic technique of reliability testing is repetition and careful observation. It has some similarities with load testing but it is not the same thing. It can be done in conjunction with load testing to increase the chance of finding some types of issues. Testing solutions need to be designed to cover the risks associated with solutions.

At the heart of reliability testing lies repetition. Similar things are done lots of time and the outcome is validated. In its simplest form identical demands, identical apart from where data cannot be used twice, are fired one at a time into the system and the functional correctness of the outcome checked. Reliability is derived based on the proportion of successful outcomes. A variation on this applies small deltas to the demand whilst maintaining its fundamental nature to look for edge case failures. Historic, “dirty data”, risk can be addressed by executing demands on a wide range of real-world data migrated into a test system.

Increased test rigour can achieved using techniques such as running transactions concurrently, using a blended mix of transactions and operating the system under load at the same time. Ultimately a full scale operational simulation could be built. However, it is important to get the balance right. The cost and time required to develop and execute more sophisticated reliability test can be significant and so the additional reliability risk they would address has to be significant.

Track Record

Our experience with embedded industrial control systems has made us “reliability aware”. Unlike many organisations we do not equate non-functional testing with performance testing. We instinctively look at the reliability risk in a system.

Over the years we have built reliability testing solutions for IT systems and control systems. This has included:

  • Testing the architectural complexity risks in a telecommunications order management and provisioning chain.
  • Testing the reliability of web solutions.
  • Building a simulation based test solution for critical embedded software functions.
  • Testing the functional reliability of process control devices

If you need an innovative approach to a specific reliability threat or have a more general concern around reliability then get in touch and we will be able to discuss how we can solve your problem.