Not long ago in a land not far away, a manufacturing company was poised to
perform a major version upgrade of its most mission-critical application. The
company relied on that application to manage the very core of its business.
Without it, business could not be done. No orders could be shipped. The company
wouldn't know a loyal customer from a stranger.
So the company invested serious time and effort into testing the new version.
Users had been cycling through training rooms full of test workstations for
months before the go-live date. The application vendor, which was on hand to
witness the first large-scale deployment of the new verison and fix bugs where
necessary, had been an excellent partner throughout the process. The new
functionality had been tested thoroughly and the user base was very excited
about it.
[ Everyone likes a good disaster story. InfoWorld's Off the Record blog tells a
new one every week. ]
The go-live planning was fairly complicated. The back end of the application was
running on a database cluster using some of the best servers money could buy.
The database itself lived on a highly redundant storage array that had been
implemented during the last application upgrade a few years earlier.
The new application version would see significant structural changes to the way
the data was stored in the database, which entailed a complicated data
conversion process that had to be performed offline. Fortunately, this process
had been tested several times as production mirrors were restored and converted
from backups for the test environment. In addition, several libraries on the
workstations had to be upgraded. Unfortunately, once those library upgrades were
complete, it would be impossible to use the old client version.
To mitigate the failback risk, the migration had been planned down to the
smallest detail. It was decided that the system would be brought down after
second shift on a Saturday evening. Database backups would be made and the
conversion process would begin. Perhaps eight to twelve hours later, testing
would be performed by a group of power users. If everything looked good at that
point, policies would be pushed to the network to initiate the client upgrades.
Soon after, Sunday second-shift employees would be able to get into the new live
system and the migration would be complete.
Finally, it was go-live time. Like clockwork, the system came down at 7 p.m. on
Saturday and the conversion process began. Early the following morning, the
migration team had completed its work. The database looked like it was ready to
go and upgraded application servers had been deployed. User testing kicked into
gear again and everything looked excellent. Then came the order to push out the
new client software -- and roughly 1,000 workstations started to receive the
client updates.
The first sign that something was amiss came in the form of a help desk call
from a power user on the shop floor. He was trying to update an order he was
working on, but it was taking a long time to get from one screen to the next. It
wasn't a big deal, it just wasn't anywhere near as snappy as he remembered it
being from training. At this point only about 150 users had received the upgrade
and logged in to the system. Members of the application team looked at each
other with a growing sense of dread. It was as if the temperature in the
datacenter had suddenly dropped 20 degrees.