Olark: Textbook Service Recovery

Olark logoPart of doing business in the cloud computing industry is handling outages. Salesforce, Twitter and Google have all had their share. Service disruptions impact customers, ranging in severity from annoyance to halting the gears of business. How companies handle these situations can make or break their level of customer retention and loyalty. Olark’s breakdown and their response last week exemplifies how to do it right.

Impact is Always Personal

When cloud software fails, the loss isn’t just in computing capability. Humans are naturally wired to react strongly to losses. When we abruptly lose the power to do something, our brains trigger a threat response in our primitive, subconscious minds. The dramatic chemical change then flares our emotions, readying us for “fight or flight” conflict. During the outage we may experience fear and anger from cascading losses, too. If we chose the vendor, our boss and fellow employees may blame us, impairing our relationships and lowering our social status.

Since loss affects humans on such a fundamental, emotional level, customers experience a moment of truth whenever there’s a service failure.  People know logically that “stuff happens,” but how they’re treated along the way leaves a lasting impression. “It’s not how far you fall, it’s how high you bounce” my friend and former CEO was fond of saying.

Five Steps to Service Recovery

High performing organizations use formal, repeatable methods to deal effectively with service failures. Managers frequently use this five-step method:

  1. Apologize. Take responsibility and say, “I’m sorry.” It’s essential to hear, but often not said.
  2. Empathize. Connect with customer feelings and validate them. Show you understand and care. Doing so diffuses emotional energy.
  3. Restore Service Quickly. Time is of the essence. Pull out the stops and fix the problem immediately.
  4. Communicate. Customers will grow antsy the longer they wait. Be transparent and deliver regular status messages throughout the incident.
  5. Follow up. Offer restitution, describe what happened and how you’ll prevent reoccurrence, and reconnect after resolution to rebuild trust.

Handling difficult situations well increases customer trust and loyalty. In a 1992 study, for example, IBM found that when customers had no problems, their repurchase rate was about 84%. If they had a problem and were satisfied with how it was handled, repurchase rate increased to 92%. If handled badly, only 46% would buy again. There’s a silver lining in these tense situations.

Olark’s Example

Following is textbook application of the five service recovery steps. Customer chat provider Olark went down on January 25 and 26 and the company’s CEO sent this message after the incident (my comments inserted in bold):

Hello –

I’m emailing to update you on our service outages on Monday and Tuesday this week.

Service was fully restored at about 11:00am PST Tuesday and all systems are still stable as of this morning.

I know this has been a very frustrating and trying time for you as an Olark customer (Empathize), and for that I apologize (Apologize). Please know that, since Monday, our team has been working through the night to resolve two different incidents. The post mortems on these incidents are here and here. (links to incident pages where Olark maintained regular status Communication).

This has been a tough two days knowing that we’ve let you down, (Empathize) and we want to make amends. (Follow Up)

We failed to provide you with the service you deserve. I wish I could tell you this outage was unpredictable, or it was all an external party’s fault, but it wasn’t. 

On Monday night, our upstream service provider experienced an unexpected outage caused by maintenance of its entire data center, which lasted for hours. By 9:06pm PST, the Olark team identified the network outage. At 10:44pm PST, the service provider acknowledged its routine maintenance had problems and was affecting its customers, including Olark. Once the issue was resolved on their end, we began to restart our servers at around midnight.

We have been aware that it was possible that a cascading reboot of Olark’s system could lead to an outage. This is the kind of exceptionally rare event that could only happen during a major data center disruption like the one on Monday night. We have in fact been working on hardening our system to this kind of risk for months.

That’s why we know it was preventable. In the end, we did not execute quickly enough to prevent these two issues from affecting you.

We feel no great irony in the fact the specific component that lead to this outage was scheduled to be replaced this week. The positive news is that we spent the last months rewriting how the particular servers affected today are set up. Had the servers been using this new set up, it would have helped avoid this issue. These updates are still due to be released imminently as they were scheduled to do so regardless of this particular outage. (Follow up: prevention of recurrence)

You can rest assured, we are taking this seriously.

I realize that doesn’t make up for lost business Monday and Tuesday though.

As a mea culpa, we are issuing you 2 days’ worth of credit on your account. You should see that reflected in the next few days. (Follow up: restitution)

If you feel this isn’t sufficient, please let me know and we can discuss further – benc@olark.com (Follow up: reconnect to rebuild trust)

Please let me know if there’s anything else I can do to help,

Ben Congleton,

Chief Executive Olarker, Olark

 

Clearly Olark dealt with a tough situation, one facing everyone in the cloud computing business. But unlike many others, the company followed essential steps to recover gracefully. Time will tell how their actions rebuilt trust and retained customers, but they indeed bounced high after falling far.

Kudos to Mr. Congleton and the rest of the Olark team.