Incident response & mitigation#
OOur technology mitigates issues instantly and automatically We don’t wait for customers to report problems to us – we report solutions and live mitigations to them.
By embracing the notion that everything fails, our technology targets those failures head on.
Because we have engineered our way out of traditional networking problems, we have had to re-engineer the rules on incident priority levels and response times.
These failures no longer represent the urgency and impact they used to, so traditional priority levels become meaningless.
Mitigation times are low, and so are time to initial contact.
Unless a major service event is declared, responding to failures is therefore always our highest priority.
|Priority 1||Component link failure - Complete service outage of any component circuit||<300ms||<600s|
|Priority 1||Hardware failure in HA - Complete service outage of primary EVX in High Availability cluster||<25s||<600s|
|Priority 1||Datacentre failure - Complete service outage of primary EVX core router or entire datacentre||<60s||<600s|
|Priority 1||DDoS attack - Distributed Denial of Service attack on network infrastructure||-||<600s|
|Priority 1||Hardware failure (single CPE) - Complete service outage of EVX||-||<600s|
|Priority 1||Complete site outage - Complete site service outage||-||<600s|
Fix times of components vary dependant on the issue. Link failures depend on technology type (ethernet, cellular, broadband), carrier and location. Some faults can run into days and weeks depending on severity or complexity. Most are fixed in hours or days.
Datacentre issues are usually fixed within minutes or hours. However, because of our preventative mindset, we don’t rush connections back to a restored datacentre immediately after vectoring customer traffic away from a DC failure. Instead, we monitor its performance and perform a cautious, incremental return of connections to their primary DC. This could be days later depending on the original issue.
Hardware failures are swapped out on a next business day service as standard, with HA EVX pairs able to ensure customer connectivity is maintained at all times.
Our focus is to have customers choose their level of risk and bandwidth so they can survive component failures. Our technology is fault tolerant by design.
This allows us to fix all component issues with the same mindset – as soon as possible. How important a site or SD-WAN instance is, is up to the customer.
When every connection is installed, we make sure it’s working optimally, and then benchmark it – something ISPs just don’t do.
Our fault thresholds are then based on those individual benchmark figures, and continuously monitored to check for a change from this known baseline.
|Latency||Round trip time from site to core - A sustained increase from benchmark||+30ms||60m|
|Jitter||Hardware failure in HA - Complete service outage of primary EVX in High Availability cluster||+30ms||60m|
|Loss1||Datacentre failure - Complete service outage of primary EVX core router or entire datacentre||+0.5%||<30m|
|Throughput||DDoS attack - Distributed Denial of Service attack on network infrastructure||-20%||60m|
|Intermittency||Hardware failure (single CPE) - Complete service outage of EVX||varies||any|
When our analytics and monitoring platform detects these changes from benchmark, support cases are automatically raised and investigated by our systems and engineers.
Degradation is often mitigated before the time horizons quoted but would not be classed as a fault needing carrier attention unless those time horizons are breached, or the frequency of intermittency has risen.
Major service events#
Rather than just target Major Service Outages (MSOs), we include all major network events in our major incident management policy.
An event that doesn’t cause downtime but causes disruption or degradation in service, such as intermittent packet loss or high latency can be just as debilitating as an outage.
We therefore define Major Service Events (MSEs) to include any degradation of service affecting multiple customers.
Our mantra guiding our communication strategy during these events is clear: keep talking.
On detecting an MSE, we declare Priority 0.
A nominated senior member of staff is assigned as MSE lead and immediately puts us on Priority 0 footing.
- All engineers are immediately re-tasked with diagnostic and mitigation measures.
- Dedicated engineers are assigned as supplier liaisons to co-ordinate third parties.
- MSE lead briefs non-technical staff on communication message to customers and partners.
- Non-technical staff call every customer and partner, repeating every 30 minutes with updates.
Large scale network events usually arise from one or more of the following:
|Level||Major Service Event (MSE)||Contact||Updates|
|Priority 0||Site transit carrier provider - Servicing last mile circuits, includes their backhaul providers||+30ms||60m|
|Priority 0||Upstream carrier providers - Servicing datacentre transit||+30ms||60m|
|Priority 0||National infrastructure issues - Major DCs and peering facilitating large parts of global or national network transit for all ISPs||+0.5%||<30m|
|Priority 0||Datacentre infrastructure - Datacentre partner core infrastructure issue or major incident (e.g. DC fire, flood)||-20%||60m|
During any MSE, we call all affected partners at minimum every 30 minutes until the issue is resolved, explaining what we know and what steps are being taken, if any, to mitigate the issues.
All major incidents affecting multiple customers are logged separately, and Reason For Event (RFE) documents are produced and circulated, again expanding upon the idea of a Reason For Outage (RFO).
Cellular services can have more packet loss than other circuit types and does not necessarily indicate a fault. Where this is the case, options are available including multiple cellular services and Forward Error Correction. ↩