This article was produced in collaboration with Statuspage as part of our supported channels program. We appreciate their support as it helps us invest in content for the greater Support Driven community.
Want to be part of the discussion? Join us in #incident-comms
Communicating to end users during an incident is a relatively simple concept, but far from easy in practice. Who should communicate? How should they communicate? What tools should you use to communicate, and should any of it be automated?
Olark is a company that fundamentally understands the power of communication. In fact, their mission is to make business human by helping teams communicate in an authentic way with live chat software. With a mission like that, it’s no wonder they also value transparent incident communication when something goes wrong.
Diving into incident communication
Olark has been a Statuspage customer for 3 years, and over time they’ve built up a pretty robust incident communication process. They’ve managed to build time-saving automation into this workflow, without losing the human touch, and they’ve done it all with a fully remote team. Color us impressed! There’s a lot to be learned from their process, so we sat down with Sarah Betts, a Customer Champion at Olark, to learn more.
Sink before you swim
Sarah is no stranger to incident management. In fact, she found herself helping handle a full-day outage her first day on the job. Talk about being thrown in the deep end! Although this sounds like a very stressful scenario (and it was), it was also a unique opportunity to sink her teeth into an important part of the role starting day one.
Olark has built a great process from the inside out with documentation, bots, integrations, and a great culture to boot. We focused our conversation in on the way they’ve used incident management tools in combination with bots to automate certain processes and save time when it matters most.
Slack as a command center
Olark uses Slack as a sort of incident command center – a place where they can connect the right team members, tools, and bots together for a more seamless incident workflow. As Sarah described, “The interplay of tools is key for a great incident communication process…It’s especially critical to run everything through a chat tool so everyone can see what’s going on throughout an incident.”
Chat alone is helpful, but the real magic is in the custom bots they’ve built…
Give a bot a shot
The team at Olark used a Hubot script to build a downtime bot (who they lovingly refer to as Joelarkami after an Olark alum). The bot takes center stage in Slack as soon as a problem is detected. First, someone on the support team summons the bot:
Once the bot confirms that the reporter meant to trigger an incident, it sends a message to the Engineering, Support, and Marketing teams and to folks on the current on-call roster. It then simultaneously opens a #downtime Slack channel with a note like: “Hold onto your hats, outage alerted!”
Now that the right people are in place, they need to start jumping into response mode. To help with this, the bot runs a google script which opens up their Support Documentation, which includes detailed information on incident response. With this guidance, the team on call can chat in the channel about their communication plan.
Since there is no substitute for face-to-face communication, especially in a time of crisis, they also set up a Zoom integration for their incident comms Slack channel. Just a quick “/zoom” and a video conference gets set up in the context of the chatter they’ve already had.
While the comms plan is being cooked up, the bot automatically opens a Jira ticket on a Jira board they’ve created to track downtime events. Additionally, a post-mortem document gets preemptively created in Google Docs.
As soon as the incident is over, the team is accountable for revisiting what went wrong and how they can improve for next time after resolution is reached, making duplicate incidents very rare.
This combo of human and bot response ensures that Olark’s customers are never left in the dark when something goes wrong. This type of transparency is rooted in their company mission and values, and makes for much more happy and loyal customers. “People know if something’s up and you’re not telling them,” Sarah told us, “It erodes trust very quickly.”
Transparency is also important internally, so all of Olark’s incident responders stay on the same page. This is especially crucial since they have a fully remote team who can’t rely on desk or water cooler chatter to stay in the loop.
A riveting remote culture
The team at Olark puts a lot of effort into creating a thriving remote culture. This helps them handle incidents better in times of crisis, and enjoy their work more when things are ‘business as usual’.
Every team has a Slack room where work and life chatter intertwine. Olark also has fun watercooler channels such as #cuteoverloard (cute animal pics, naturally!) #spoiler-alerts, #olark-to-eat, #reading-rainbow, #dj-roomba, #outdoors, and more. Every Friday there is a Show and Tell hour where they share learnings, and their a spreadsheet coined OlarkBnB that guides them to Olarker’s homes when employees are traveling the world. As Sarah put it, “we learned over many years that building solid relationships makes problem solving faster and easier.”
Try it yourself
Interested in adding some automation to your incident communication process? Tell us about your automation in #incident-comms and here are some tips and links to help you get started:
- Decide as a team how much automation you’re comfortable with. We recommend starting small and then increasing as you see fit.
- No matter how much automation you employ, make sure to keep your comms human to build rapport and trust with your customers.
- Before you utilize new automation and/or bots during an incident, run a couple incident fire drills with the team so everyone is used to the new process before something is actually on fire.
- Run internal post-incident reviews (PIRs) to assess how well your incident response processes are working (including your communications). Here’s a helpful playbook on running an incident communication-focused PIR.