jump to navigation

Data Integrity Priority One June 26, 2009

Posted by stewsutton in Information Technology, Knowledge Management.
Tags: ,
trackback

Sometimes we have the opportunity to bear witness to a collection of failures that all arrive at the same time.  It is during those rare and unfortunate times that we can rank, value, and prioritize where we wish to put our mitigation efforts to minimize risk in the future.  I had the opportunity to do a bit of risk ranking this past week when our corporate wiki decided to take a holiday.

Corporate wiki’s of the “COMPANYpedia” format can quickly become a useful tool that many people come to depend on.  Some use the service as a sort of bookmarking index to content items that are distributed across the network.  Others plan projects and current state of activity for coordination between team members, while others take the Wikipedia approach and author articles (pages) that take an encyclopedia-style format to defining the knowledge of the organization.  We have a mix of that in our CORPpedia service.  It has been growing in popularity for more than three years now.  And then we had our “Black Monday” event of this past week.

After updating the Wiki’s operating system configuration, the administrative console was acting a bit strange so the decision was made to reboot the system.  Now here is where an additional piece of complexity comes into the equation. The wiki is running on our virtual infrastructure.  It lives on a virtual server that is part of a larger server collection.  So rebooting the platform amounts to sending a signal from the virtual server system console to the machine identity.  After the system reboot command was issued, the virtual server disappeared.

Not to worry, we just need to exercise the administrative tool set and locate that image.  It’s out there somewhere.  The staff looks and they can’t find anything.  It’s as if the image of that machine (all virtual servers are actually a software image that runs as if it were an actual machine) was never there.  Not only had Elvis left the building but there was no record that he ever made an appearance in our data center.  Well this is getting a bit weird so while the staff continues to look for the missing image, another part of the support group begins to hunt down the most recent backup.  Here is where you get to imagine some folks getting pretty concerned.  While one team is looking for and not finding the virtual image that was running there just a few minutes ago, another team is beginning a quest to find a backup from the daily backup store to start the process of recovery.  And things get worse.  The logs of the system that should contain the virtual image are shipped over the Internet to the vendor support staff for analysis.  While that is happening, the staff in charge of locating the backup has determined that no successful backup of this system has been executed since it moved to the virtual infrastructure.  A question begins to form in peoples minds.  How far back will we have to go in time to reconstruct the corporate wiki?  The answer turns out to be two months.  That’s two months of time where hundreds of users were updating, using, and cross-linking material within this collaborative service.

Now comes a time to evaluate the two failures.  There was a process and technology failure associated with the disappearance of the virtual server.  And there was a major procedural failure in the backup process for that service.  While there was a confidence hiccup surrounding the virtual server technology, the bigger confidence gap emerged in relationship to the 2-month data loss.  That was truly unacceptable for an enterprise IT group and it will take time to re-build confidence in the service.  We have already seen that slow return to confidence this week.  What used to be hundreds to thousands of edits per day in the wiki has now trickled to tens of edits.  There is an element of “where do I begin” and “that didn’t just happen did it?” that still is floating around the service.  We have established tech support blogs, mail lists, and direct contact numbers to address staff concern.  And we have had initiated both enterprise-wide communications and targeted communications for specific known users of the service.  All of this is there to address confidence.

The winning failure in the largest gap of confidence contest is the data loss due to procedural and technology failure.  This failure beats the virtual server technology failure by a landslide.  This confirms that data integrity is priority one.  That probably does not come as a surprise, but in this case, we have a parallel failure to measure against.

So we now begin the reconstruction of an interconnected, multi-author, segment of corporate memory.  It’s like trying to restore memory with just a light fuzzy image (if any image) of what once occupied that space. In a year this will be just an unpleasant historical event, but now its quite a bit more than that.

Advertisements

Comments»

No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: