Service Interruptions

Our goal is to deliver services that are always available to customers and users. However, reality dictates 100% availability is not realistic, and therefore there are two categories of interruptions we must address.

Note: Using the term interruption over downtime to account for cases where noticeable system degradation may occur. The term downtime tends to imply the system is entirely offline.

Planned Maintenance: Most typically system update and configuration changes to the system either lead to degraded performance or a complete outage of the system.

Unplanned Interruptions: Mostly typically, when the system fails to meet the performance goals, in extreme cases, the entire system may be offline.

Planned Maintenance Guide

Principles

  • Reduce customer impact as much as is economically possible.
  • Accept 100% uptime is not possible.
  • Make rational economic decisions around engineering effort vs. downtime effects.
  • Reduce risk through small batch sizes.

Process

  • Plan the event as far out as is realistic. Most likely will be measured in days, maybe weeks, but almost certainly not months or quarters.
  • By default, we use a minimum notification threshold of 2 business days. Exceptions to the default may include cases where the change is an urgent need, such as patching a data corruption bug or critical security patch.
  • By default, we aim to deploy changes requiring more than 15 minutes of interruptions outside of business hours.
  • Notify concerned parties of the work. Planned interruptions notifications are submitted as announcements within support.xxxxxxxxx.io site.
    • Perform the work. Status updates are to be added to the announcement article to ensure customers receive status notification of the progress.
    • Notify the concerned parties that the work is complete and the overall state of the system. Once the work is complete, any final notes relevant to customers are placed in a comment.
    • The completion of the work is noted by prepending ‘COMPLETE-’ to the article title.

Notification Template

  • Rationale
    • For planned outages, effective communication will give concerned parties time to prepare in advance. If concerned parties understand the reason for the outage, you’re more likely to get their support and cooperation.
  • Include:
    • A brief, non-technical explanation of the proposed IT event/outage/upgrade – including the services involved An explanation of why it’s being undertaken – what’s the benefit? The start and end time that services will be affected Next steps once the outage is complete Contact details of at least one person

Template

Summary

What? [BRIEF DESCRIPTION OF THE WORK BEING DONE]

When? [DATE, TIME AND TIMEZONE]

Who? Anyone who uses [SERVICE OR SYSTEM] in [OFFICE OR LOCATION]

Details

On [DATE] the [OFFICE OR LOCATION] scheduled maintenance on [PRODUCT] systems and services will be performed. The outage / degradation in performance is expected to last from [START TIME AND TIMEZONE] until [END TIME AND TIMEZONE]. The [PRODUCT] Team will advise when all services are restored.

Services impacted: [LIST OF SERVICES]

What is not impacted? [LIST OF SERVICES]

Impact:  Description of the impact

Questions? Please contact the [PRODUCT] Service Desk by emailing [help](xxxx@xxxxxxxxx).

Unplanned Interruptions Guide

Principles

  • DO NOT PANIC
  • Accept that Unplanned Interuptions occur.
  • Make rational economic decisions around recovery effort vs. downtime effects.
  • Remain Blameless.
  • Clear Communications Improve Recovery
  • Document timelines, actions, metadata for future use.
  • Make Rational decision to either roll back or roll forward.

Process

  • Assess the interuption. Determine if we are interupted, spewing errors, or down.
  • Establish an Incident Communication Channel.
  • Establish Incident Commander.
  • Document Incident metadata and actions.
  • Notify engineering stakeholders and business stakeholders.
  • Review logs and monitoring resources. Identify interruption timeline
  • Make Rational decision to either roll back or roll forward.
  • Clearly communicate engineering decision to stakeholders.
  • Execute engineering actions: rollback/rollforward.
    • Perform the work. Status updates are to be added to the Incident Document, Communication Channel to ensure stakeholders receive status notification of the progress.
    • Service Restoration
      • Notify the stakeholders that the work is complete and the overall state of the system.
      • Document timeline, actions, communication channel context, Incident metadata.
      • Schedule Incident Review for Response Engineers only. NO stakeholders
      • Schedule Postmortem for Engineers + Stakeholders

Rationale

  • For unplanned outages, timing is everything. Staff need outage alerts as soon as possible to avoid frustration and an influx of Helpdesk calls.
  • Inform them of alternative systems to use during the outage period so that they can keep working. Include:
    • Which services are affected – including whether the service is slow, intermittently down or totally down A brief, non-technical explanation of what has happened (if known) Next steps in diagnosing and fixing the cause Anticipated timeframes for resolution (if known) Contact details of at least one person

Template

Summary

What? Issue with [SERVICE]

When? [START TIME, TIMEZONE AND DATE]

Who? [SERVICE OR SYSTEM] has been experiencing problems since [START TIME, TIMEZONE AND DATE]. Users in [OFFICE OR LOCATION] are currently unable to access [SERVICE OR SYSTEM].

Details

On [DATE] the [OFFICE OR LOCATION] an unscheduled interruption on [PRODUCT] systems and services has occurred. Users may experience an outage / degradation in performance and or availability, beginning at [START TIME AND TIMEZONE] . The [PRODUCT] and [SRE] Teams are actively responding to the interruption, and working on recovery and remediation. Updates will be posted in the appropriate communications channels. 

Services impacted: [LIST OF SERVICES]

What is not impacted? [LIST OF SERVICES]

Impact:  Description of the impact

Questions? Please contact the [PRODUCT] Service Desk by emailing [help](xxxx@xxxxxxxxx).

[
The SRE Team is working to resolve this issue. An update will be sent at [TIME AND TIMEZONE].