July 17, 2007

Hooray for the 11g ASM: “Fast” Mirror Resync at last !

Forgive me if I sound over-enthusiastic: I already mentioned in RAC geo clusters on HP-UX and in RAC geo clusters on Solaris how annoying the absence of incremental mirror rebuild was to the ASM based RAC geo clusters. As a matter of fact, the fact that a full rebuild of the mirror structure is needed writes off this architecture for data bases over a few hundreds GB. Here is a very common situation to me: you have a 2 sites RAC/ASM based geo clusters made of one storage array and one node on each site. Site A has to be taken down for cooling maintenance (actually sites are taken down at least 5 times a year), but the surviving site B must be kept up to preserve the week end operations. The second site ASM instances on site B are gracefully taken down, but all of the mirrored ASM failure groups have all to be fully rebuilt when they are brought up. The more databases, the more terabytes have to move around.

An actual outage is a more dramatic opportunity for trouble: assuming the clusters transfer the applications loads on your backup site, you nonetheless have to wait for the outage cause to be fixed plus the ASM rebuild time to be back to normal. Just prey you don’t have a second outage on the backup site while the first site is busy rebuilding its ASM failure groups.

The consequences of this single ASM weakness reach as far as having to use a third party cluster on a SAN just to be able to make use of the VxVM Dirty Logging Region for large databases. Having to make a “strategic” decision (3rd party cluster or not) on such a far reaching choice solely based on this is to me a major annoyance.

There are a few promising words in the Oracle 11g new features area posted on the OTN sites about a “Fast Mirror Resync” in Automatic Storage Management New Features Overview which should just be the long awaited “DLR” ASM rebuild feature. ASM can now resynchronize the extents that have been modified during the outage. It also looks like a failure group has a DISK_REPAIR_TIME attribute that defines a window in which the failure can be repaired and the mirrored failure group storage array be brought on-line, after which and “ALTER DISKGROUP DISK ONLINE” will starts the process of resynchronization. What happens if you exceed DISK_REPAIR_TIME is not said.


