We reform how you...
Reliability requires special attention. It is often neglected. It is hard to do on an ad hoc basis. That failure, or burst of failures, the thing that gets reported every few weeks, is always hard to track down. There is never enough evidence at the scene. The problem can’t be reproduced.
The problem becomes a running sore. The costs of dealing with the aftermath keep coming back. A loss of confidence can lead the solution being rejected.
Reliability issues are hard to prevent and diagnose, because of “hidden” variability. Demands that are thought of as being “identical” actually have differences that cause the system to behave erroneously. Variability in the state of internal components at the time of execution can cause difference in the way demands that are identical are processed. Looked at from the outside these failures look like purely random events, sometimes it works, sometimes it doesn’t work.
Reliability defects are a latent danger. “Normal” development and test practices miss them. If they are triggered during testing then they cannot be reproduced, it is too painful and too time consuming. They are closed, recorded as “tester errors” or “data issues” or as “environmental”. Then, once the system starts to get used in anger, they emerge, suddenly things that were known to work seem to have new issue.
Complexity leads to unreliability. Unavoidable complexity comes in many forms, including:
Assuring the reliability of a solution involves identifying the reliability risks and the sources of such risks and ensuring they are tackled by design, in implementation and by test. Different systems have different risks and therefore need different approaches. Both the nature of the risks and the importance of reliability will affect the sorts of activities undertaken to address reliability.
The basic technique of reliability testing is repetition and careful observation. It has some similarities with load testing but it is not the same thing. It can be done in conjunction with load testing to increase the chance of finding some types of issues. Testing solutions need to be designed to cover the risks associated with solutions.
At the heart of reliability testing lies repetition. Similar things are done lots of time and the outcome is validated. In its simplest form identical demands, identical apart from where data cannot be used twice, are fired one at a time into the system and the functional correctness of the outcome checked. Reliability is derived based on the proportion of successful outcomes. A variation on this applies small deltas to the demand whilst maintaining its fundamental nature to look for edge case failures. Historic, “dirty data”, risk can be addressed by executing demands on a wide range of real-world data migrated into a test system.
Increased test rigour can achieved using techniques such as running transactions concurrently, using a blended mix of transactions and operating the system under load at the same time. Ultimately a full scale operational simulation could be built. However, it is important to get the balance right. The cost and time required to develop and execute more sophisticated reliability test can be significant and so the additional reliability risk they would address has to be significant.
Our experience with embedded industrial control systems has made us “reliability aware”. Unlike many organisations we do not equate non-functional testing with performance testing. We instinctively look at the reliability risk in a system.
Over the years we have built reliability testing solutions for IT systems and control systems. This has included:
If you need an innovative approach to a specific reliability threat or have a more general concern around reliability then get in touch and we will be able to discuss how we can solve your problem.