One of my customers experienced a storage outage this week end exceptional enough to be worth mentioning here:
I designed with the storage guys a 4 directors dual-fabric configuration, deemed to be the safest configuration you can think of in terms of SAN topology. As each server is connected to two fabrics, a “logical” problem (DSN bug for example) in one fabric would not impact the servers availability. Each fabric is spread over two sites, and each data base server is clustered between the two sites. Each director as a number of high availability features such as dual AC power.
A site failure triggers cluster failovers.
The risks, threats and vulnerabilities were assessed but… no one (including me of course) ever considered this category of bug on the Brocade switches:
http://sunsolve.sun.com/search/document.do?assetkey=1-26-101607-1
Sun(sm) Alert Notification
Sun Alert ID: 101607 (formerly 57687)
Synopsis: A Switch or Director With Fabric OS 4.2.x or Earlier May Panic if it Has Been Up For More Than 497.1 Days
Category: Availability
Product: Brocade SilkWorm 3850 Fabric Switch, Brocade SilkWorm 24000 Director, Brocade SilkWorm 3250 Fabric Switch, Brocade 12000 2 GB Switch, SAN Brocade 12000 2 GB 64-port Switch, Brocade SilkWorm 3900 Switch, Brocade 2400 1 GB 8-Port Switch
BugIDs: 6197589
Avoidance: Patch, Workaround
State: Resolved
Date Released: 24-Nov-2004, 30-Mar-2005
Date Closed: 30-Mar-2005
Date Modified: 30-Mar-2005, 01-Mar-2006
The “may panic” should actually be read as “will panic’! All switches were powered on exactly 497 days ago when the switches were installed. As the switch startup times were approximately identical on each site, both couples of switches on each site panicked at the same time this week end causing a first failover from site A to site B followed one hour later by a failback from B to A.
Moral of the story: just like Windows in the old days, reboot the SAN directors every x months (just joking – only proactive patching could have prevented this chain of incidents).