Have Redundancy, and have a Backup

During toolbox meetings, I used to tell my team:

Rule number one, Have Redundancy.
Rule number two, Have a Backup.

This would sometimes be meet with a light laugh from new participants. But to those that had been operating for a while, it was a well-known saying. For a number of people this seemed to be the same thing. Surely redundancy was the same as a backup, hence the laughs.

However, there is a subtle but crucial difference between the two. The magical thing about this is, it applies to operational concerns (such as flying drones), as well as technical areas (such as IT and technology development).

Redundancy: The First Line of Defence

Redundancy refers to the inclusion of extra components or systems that are not strictly necessary to functioning but are there to take over in case the primary component fails. Think of it as having multiple layers of protection. For example:

  • Operating Drones: When flying drones, having multiple propellers ensures that if one fails, the drone can still remain airborne and be safely landed, with a new prop being installed quickly.
  • Redundant Servers: In a data center, having multiple servers that can take over if one fails ensures that the system remains operational.
  • Redundant Power Supplies: In critical systems, having more than one power supply can prevent downtime in case one fails.
  • RAID: Some of the physical disks have copies, or partial copies of the data, so if a single disk fails, they can all continue to operate.

Redundancy is about immediate failover. It ensures that there is no single point of failure and that operations can continue seamlessly even if one component fails.

Backup: The Safety Net

Backup, on the other hand, is about having a copy of data or a system that can be restored in case of failure or loss. It is not about immediate failover but about recovery. For example:

  • Operating Drones: Having a secondary drone ready to deploy in case the primary drone fails ensures that the mission can continue with minimal disruption.
  • Data Backups: Regularly saving copies of data to a separate location ensures that you can restore it in case of data corruption or loss.
  • System Backups: Creating images of entire systems that can be restored in case of a catastrophic failure.
  • Alternate System: A secondary system that allows for continuous delivery of the service, but may be at the expense of some functions. This may be a smaller item/system, or may be an identical system.

Backups are about long-term recovery. They ensure that even if something goes wrong, you can restore your system or data to a previous state.

Why Both Are Essential

While redundancy provides immediate protection and ensures continuity, backups provide a way to recover from failures that redundancy cannot protect against. Here’s why both are essential:

  • Redundancy ensures that operations continue without interruption.
  • Backups ensure that you can recover from data loss or system failures.

In essence, redundancy is your first line of defence, while backups are your safety net. Together, they provide a comprehensive strategy for risk management and operational continuity.

This requires users to pause, and think ahead to ensure that the system that they are deploying is capable of both of these, or that the risk is being consciously accepted.

Another critical factor that comes in to play more when designing the technology (or designing the architecture) is, that if the system cannot alert you to the fact that something has failed, that it is in a failover mode, or has activated the redundancy functionality. Then it does not have the intended redundancy, or backup. Because, how else would you remediate the system?