Christian Bilien’s Oracle performance and tuning blog

October 22, 2007

Tale of a not-so-acute risk assessment

Filed under: Storage — christianbilien @ 9:41 am

One of my customers experienced a storage outage this week end exceptional enough to be worth mentioning here:

I designed with the storage guys a 4 directors dual-fabric configuration, deemed to be the safest configuration you can think of in terms of SAN topology. As each server is connected to two fabrics, a “logical” problem (DSN bug for example) in one fabric would not impact the servers availability. Each fabric is spread over two sites, and each data base server is clustered between the two sites. Each director as a number of high availability features such as dual AC power.

A site failure triggers cluster failovers.

The risks, threats and vulnerabilities were assessed but… no one (including me of course) ever considered this category of bug on the Brocade switches:

Sun(sm) Alert Notification

Sun Alert ID: 101607 (formerly 57687)

Synopsis: A Switch or Director With Fabric OS 4.2.x or Earlier May Panic if it Has Been Up For More Than 497.1 Days

Category: Availability

Product: Brocade SilkWorm 3850 Fabric Switch, Brocade SilkWorm 24000 Director, Brocade SilkWorm 3250 Fabric Switch, Brocade 12000 2 GB Switch, SAN Brocade 12000 2 GB 64-port Switch, Brocade SilkWorm 3900 Switch, Brocade 2400 1 GB 8-Port Switch

BugIDs: 6197589

Avoidance: Patch, Workaround

State: Resolved

Date Released: 24-Nov-2004, 30-Mar-2005

Date Closed: 30-Mar-2005

Date Modified: 30-Mar-2005, 01-Mar-2006

The “may panic” should actually be read as “will panic’! All switches were powered on exactly 497 days ago when the switches were installed. As the switch startup times were approximately identical on each site, both couples of switches on each site panicked at the same time this week end causing a first failover from site A to site B followed one hour later by a failback from B to A.

Moral of the story: just like Windows in the old days, reboot the SAN directors every x months (just joking – only proactive patching could have prevented this chain of incidents).




Blog at