Reliability, Redundancy and back-up systems

When designing and building a complex space system, especially when talking about crewed space flight, one aims for a reliability "as high as possible". When looking at the NASA commercial crew program, the probability of loss of crew is set to 1/270 for the overall mission and a 1/1000 probability for both the ascend and descend phase. The loss of mission is higher and set to 1/52. There is a significant difference between loss of mission and loss of crew. When the launch escape system is activated as was the case with Soyuz MS-10 the mission is lost, but the crew is not. This brings us to the distinction between the different terms: Reliability, redundancy and backup systems.

Reliability is the overall probability of the system performing the mission. The higher the reliability, the higher the odds a mission is performed successfully. High reliability can be achieved when keeping a system simple, perform frequent and many (relevant) tests, and including redundancy and back up where necessary. Generally, it can be said that the simpler a system is, the less can go wrong. Second, to that, the more one reuses old hardware, so flight-proven hardware, the higher the reliability of that piece of hardware. It could be stated that a properly engineered product is much like a good wedding. It includes something old, something new, and something borrowed.

An example of this is the EDL system of the Mars 2020 rover. The system uses the parachute flight data from the very first Mars landers and the PEPP project, the sky crane borrowed from the Curiosity rover, and a new and updated landing system for a more accurate landing. Another method for increasing reliability is frequent testing. When looking at the SpaceX Crew Dragon parachute system, one can see the high number of tests done to prove the system worked. By testing one can flight test unflown hardware.

Redundancy is the addition of extra hardware to remove a single point of failure. This can be done on any scale in a project. Redundancy can be added on PCB level, where the detection and actuation of the system are done twice, so when one line fails, the other can take over. Redundant systems are used during the flight. An example of this is the flight computers of the X-29 forward swept-wing jet. The aircraft had three redundant digital computers and three redundant back up analogue computers. Each of the computers could fly the aircraft, but the redundancy allowed for an internal check. Forward swept-wing aircraft are unstable, thus requiring many corrections by the computer. The X-29 had about 40 corrections per second. This is a point where redundancy makes sense. When the computer is wrong for just a second, the mission is lost. Having three computers allow for voting and thus filtering a wrong computer. Another example of redundancy is a parachute cluster. Most clusters have one-parachute-out capability, meaning a safe landing is guaranteed even if one canopy fails.

Backup systems are systems that are present but are not used during a regular flight. One example is the backup motors onboard the European service module of Orion. The ESM has one main engine in the centre of the module, and four smaller backup motors. These motors can take over the tasks of the main engine in case of a failure, always bringing back the crew safely. Another example is the backup parachute of the Soyuz capsule. The backup parachute is smaller than the main parachute, so the landing velocity is higher. But when presented with a choice between a harder landing or a crash, the preference is clear.

Now redundancy and back up systems are not a free methods of increasing reliability. In some cases, they can even decrease the reliability of the overall system. One example is the Soyuz 23 mission, which landed in lake Tengiz (Теңіз көлі). Due to a short in the electrical system, the backup parachute was deployed. The backup parachute and the other parachutes soaked up water and dragged the capsule down. Now, it might not be the case that the capsule would have remained afloat with just the main parachute deployed, but the backup parachute definitely did not help. The second drawback of redundancy and back up systems is the additional mass, cost and complexity of the system. An engineer should be very careful when adding either redundancy or back up systems and carefully make the trade-off if it is necessary and worth it. So as with anything in systems engineering, ask yourself what the requirements are. If a reliability of 0.95 is sufficient by requirements, then designing for a reliability of 0.99 is thus overdesigned.

Let's take a look at a case study, in this case, the Apollo parachute actuation system. A schematic of this can be founding the figure below. On top the drogue parachute actuation. There are four baroswitches that can trigger an actuation. Besides the automatic switches, there is a crew backup switch. in other words, the baroswitches are redundantly executed, and the manual overwrite is a backup system. Then the heat shield and two mortar systems are actuated using redundant firing lines. In the case a single line breaks, the system will still work. On the main parachute side, one can see a similar system, but then with three mortars.

figure from: NASA TN D-7437