Defining the Common SPOF: A Critical Risk for Multiple Dependant Systems

A Single Point of Failure (SPOF) is a device, system, or process that, when broken, impacts numerous downstream systems. In general terms, it is the thing that if it fails, everything comes to a halt. Generally, it is impossible to remove a SPOF, but unwittingly, we often have a common SPOF that will impact multiple dependant services, and during a failure, make the impact significantly larger.

In IT, this will commonly be network equipment such as a switch, router, or cable. It is very common to have a chain of devices or systems that are each a single point of failure but can be viewed as a Single Point Of Failure Chain (SPOFC). As part of typical risk management, it is advised to have a redundant data link or a fail-over. This is recommended to give you the ability to continue working when the data connection (the assumed most likely part of the chain of single points of failure) fails. It is the most common failure that I see with clients in remote and regional areas that are using a copper-based data connection. But for clients with high-speed fibre, the fibre is often the most reliable part. This usually is seen as “we have completed our risk management” and can lead to checklist risk management.

If we consider the following simplified diagram (loosely based on a real deployment), we can see that there are two NBN connections, into a top of rack switch, they then feed into a router, and that router into a switch, and then into the servers and a NAS. These servers now run a variety of services.

graph TD
    A[NBN FTTP 1]:::cloud
    C[Router 1]:::router
    H[Switch 1]:::switch
    I[Router 2]:::router
    D[Switch 2]:::switch
    E[Server 1]:::server
    F[Server 2]:::server
    G[NAS]:::nas
    O[NBN FTTP 2]:::cloud
    P[Server 3]:::server
    Q[Server 4]:::server

    A --> H
    O --> H
    H --> C
    H --> I
    C --> D

    D --> E
    D --> F
    D --> G
    D --> P
    D --> Q

    E --> J[IIS]:::service
    E --> K[SQL]:::service
    F --> L[Timesheets]:::service
    F --> M[Accounting Package]:::service
    F --> N[CRM]:::service
    P --> R[PBX]:::service
    Q --> S[Backup Services]:::service

    E -.-> F

    classDef cloud fill:#f9f,stroke:#333,stroke-width:2px;
    classDef router fill:#bbf,stroke:#333,stroke-width:2px;
    classDef switch fill:#bfb,stroke:#333,stroke-width:2px;
    classDef server fill:#ffb,stroke:#333,stroke-width:2px;
    classDef nas fill:#fbb,stroke:#333,stroke-width:2px;
    classDef service fill:#fff,stroke:#333,stroke-width:1px;

If we consider that all users access this system are external to the network (i.e. they connect to the servers, and services, via the internet and are not on the local network), we can evaluate what the impact of a single point of failure would look like. But first, let’s simplify this diagram by looking at what contributes to a Single Point of Failure Chain.

Because the NBN FTTP (National Broadband Network, Fibre-To-The-Premise) shares the same infrastructure (however they may have a different ISP on top of), we can assume that they are actually the same in this chain as a failure of the NBN carrier will be the same regardless of the carrier.

Because the access to this system needs to be done from outside of the network, a user needing to access the Timesheets, or the PBX will need to traverse: NBN -> Switch 1 -> Router 1 -> Switch 2 -> Server X. We can simplify this, and ignore the second router, the updated diagram would look like this:

Grouping

graph TD
    subgraph SPOFC
        A[NBN FTTP 1]:::cloud
        O[NBN FTTP 2]:::cloud
        H[Switch 1]:::switch
        C[Router 1]:::router
        D[Switch 2]:::switch
    end

    E[Server 1]:::server
    F[Server 2]:::server
    G[NAS]:::nas
    P[Server 3]:::server
    Q[Server 4]:::server

    A --> H
    O --> H
    H --> C
    C --> D

    D --> E
    D --> F
    D --> G
    D --> P
    D --> Q

    E --> J[IIS]:::service
    E --> K[SQL]:::service
    F --> L[Timesheets]:::service
    F --> M[Accounting Package]:::service
    F --> N[CRM]:::service
    P --> R[PBX]:::service
    Q --> S[Backup Services]:::service

    E -.-> F

    classDef cloud fill:#f9f,stroke:#333,stroke-width:2px;
    classDef router fill:#bbf,stroke:#333,stroke-width:2px;
    classDef switch fill:#bfb,stroke:#333,stroke-width:2px;
    classDef server fill:#ffb,stroke:#333,stroke-width:2px;
    classDef nas fill:#fbb,stroke:#333,stroke-width:2px;
    classDef service fill:#fff,stroke:#333,stroke-width:1px;

graph TD
    D[SPOFC]:::spofc
    E[Server 1]:::server
    F[Server 2]:::server
    G[NAS]:::nas
    P[Server 3]:::server
    Q[Server 4]:::server

    D --> E
    D --> F
    D --> G
    D --> P
    D --> Q

    E --> J[IIS]:::service
    E --> K[SQL]:::service
    F --> L[Timesheets]:::service
    F --> M[Accounting Package]:::service
    F --> N[CRM]:::service
    P --> R[PBX]:::service
    Q --> S[Backup Services]:::service

    E -.-> F

    classDef spofc fill:#f00,stroke:#333,stroke-width:2px;
    classDef server fill:#ffb,stroke:#333,stroke-width:2px;
    classDef nas fill:#fbb,stroke:#333,stroke-width:2px;
    classDef service fill:#fff,stroke:#333,stroke-width:1px;

It is now pretty clear that a failure of the SPOFC will impact all of the services. With this clear view, there are now several options to evaluate with considerations of:

Criticality of the services,
Availability requirements,
Impact to stakeholders of the services,
Severity of a failure on the stakeholders,
Company risk profile,
Risk management, and
Monetary or budgetary constraints.

The options for mitigation at a high-level include:

Accept the failure mode, and leave alone.
Relocate the services so that they do not share a common SPOFC:
- Physically change the network topology.
- Physically move to another location so they are not dependant on the same chain.
- Evaluate if the services could be taken on by a SaaS vendor.
Reduce the length of the chain, and therefore reduce the likelihood of a failure in the chain.
Procure more reliable devices to install in the chain (expensive, and at small deployments may see no better results).
Configure the systems in HA modes.

I am a big fan of looking at modes of distribution such as handing over sections to a SaaS vendor, use of other services (SharePoint/OneDrive) to distribute files as a first pass to reduce the severity of a failure. And then look at reduction of the chain length.

A critical takeaway is you need to evaluate where the users are in relation to the services they are accessing. If they are present on the same network connected to Switch 2, then there is only a single device in the Single Point Of Failure Chain.