Life is full of learning experiences, and we had one yesterday.
A minor patch to our environment exposed underlying database corruption, which resulted in our internal social platform being unavailable for almost a full business day.
The backups? They were corrupt as well.
Thanks to the exceptional effort of everyone involved, nothing significant was really lost.
Sure, there are lessons to be learned on proper support practices for important applications (and our social platform is now one of those), but there are other lessons to be learned as well.
#1 — The Impact Was Stunning
Len wrote a great post on “The Air That I Breathe“, and I think that’s a great analogy.
All day long, it was hard for many of us to get business done, simply because the platform wasn’t available. It was pretty much in the same league as “email unavailable”.
So, at what point did this social platform go from “nice to have” to “need to have”? There wasn’t a defined point that I can see, it just kind of snuck up on us.
People were resilient, and adapted — that’s what we all do anyway. But it was a huge impact to a lot of people’s workday, and didn’t do anything to help with establishing confidence around the platform.
#2 — At Some Point, Declare Your Social Platform As Mission Critical
We didn’t do that.
As a result, we didn’t get the same operational procedures that EMC’s top-tier applications get. I’m *not* blaming the IT guys — they have a schema as to how they categorize things, and our application wasn’t in the appropriate tier.
Why does that matter? More scrutiny and extra effort is applied to make sure that the application is always available — and usually at significant additional cost.
Some of the investments that top-tier applications get include:
advanced test, dev and staging environment to allow quick roll-back if there’s a problem
snapping off disk copies of your database and running consistency checks before it goes to tape or other backup device
HA failover of servers, storage — or even physical locations!
Maintenance at off-hours, rather than prime time
Well, now we have a case to do elevate the category, so to speak.
And probably a willingness to spend more $$$ to keep this from happening again.
#3 — Vendors In This Space Will Need To Revisit Their Processes
EMC sells mission-critical hardware and software for a living. We know what top-tier customer support looks like — it’s an integral part of our business.
You never can get good enough at this stuff, trust me.
Now, we’re not blaming anyone here, but I think it’s safe to say that we were exercising our software vendor’s support processes in a very unique and unexpected manner. We had 10,000+ users down, and things were pretty bleak there for a while.
Everyone pitched in and helped once an emergency was declared, but it was pretty clear that it was an immature process, relatively speaking.
If you’re a vendor in this space, and you’re convincing customers that your product is essential to their business, and your customer does what you told them to do and now has their entire company running on your stuff, you’re going to have to start thinking like a mission-critical vendor, and invest appropriately.
Everything breaks now and then — it’s what technology does.
What can’t break are the service and support processes: problem escalation, expert triage, advanced notice of potential problem areas, proactive preventative fixes … the whole ball of wax.