Christian Bilien’s Oracle performance and tuning blog

July 17, 2007

Hooray for the 11g ASM: “Fast” Mirror Resync at last !

Filed under: Oracle,RAC — christianbilien @ 9:05 pm

Forgive me if I sound over-enthusiastic: I already mentioned in RAC geo clusters on HP-UX and in RAC geo clusters on Solaris how annoying the absence of incremental mirror rebuild was to the ASM based RAC geo clusters. As a matter of fact, the fact that a full rebuild of the mirror structure is needed writes off this architecture for data bases over a few hundreds GB. Here is a very common situation to me: you have a 2 sites RAC/ASM based geo clusters made of one storage array and one node on each site. Site A has to be taken down for cooling maintenance (actually sites are taken down at least 5 times a year), but the surviving site B must be kept up to preserve the week end operations. The second site ASM instances on site B are gracefully taken down, but all of the mirrored ASM failure groups have all to be fully rebuilt when they are brought up. The more databases, the more terabytes have to move around.

An actual outage is a more dramatic opportunity for trouble: assuming the clusters transfer the applications loads on your backup site, you nonetheless have to wait for the outage cause to be fixed plus the ASM rebuild time to be back to normal. Just prey you don’t have a second outage on the backup site while the first site is busy rebuilding its ASM failure groups.

The consequences of this single ASM weakness reach as far as having to use a third party cluster on a SAN just to be able to make use of the VxVM Dirty Logging Region for large databases. Having to make a “strategic” decision (3rd party cluster or not) on such a far reaching choice solely based on this is to me a major annoyance.

There are a few promising words in the Oracle 11g new features area posted on the OTN sites about a “Fast Mirror Resync” in Automatic Storage Management New Features Overview which should just be the long awaited “DLR” ASM rebuild feature. ASM can now resynchronize the extents that have been modified during the outage. It also looks like a failure group has a DISK_REPAIR_TIME attribute that defines a window in which the failure can be repaired and the mirrored failure group storage array be brought on-line, after which and “ALTER DISKGROUP DISK ONLINE” will starts the process of resynchronization. What happens if you exceed DISK_REPAIR_TIME is not said.


December 24, 2007

11g ASM preferred reads for RAC extended clusters

Filed under: Oracle,RAC — christianbilien @ 10:14 am

I expressed this summer my delight over the 11g fast resync option for extended (geographical) RAC clusters, for which the mirrored failure groups have to be on different sites, like having one of the failure group on array 1 on site 1 and the mirrored copy on array 2 on site 2. The major 10g limitations are twofold in this area:

  • The loss of an array means the mandatory reconstruction of the whole data base.
  • Failgroups have to be reconstructed serially (no more than one rebalance activity per node, per ASM group).

This is really painful as any scheduled shutdown of one array while the other one is still active is analogous to an outage. Imagine what it takes to rebuild 100TB.

Normal redudancy (mirrored) reconstruction is much slower than either the SAN or the storage layout allows (there is room for “research” here – the best I have seen so far is 150GB/Hrs, that’s 43 MB/s on 64 disks 15000 rpm on the highest DMX storage array. The FC ports where 2x 4Gb/s).

The second problem with RAC 10g extended clusters is linked to the way the ASM storage allocator will create primary and mirrored extent. In a disk group made of two failgroups, notions of “primary fail group” and “secondary fail group” will not exist. Each failgroup contains an equal number of primary and mirrored copies. Let’s take the example of a diskgroup made of two failgroups, each on a different storage array. For each of the storage arrays, here is the number of primary and mirrored copies for a given group (it is made of just one datafile here):

We’ll use the X$KFFXP table. The columns of interest here are:

GROUP_KFFXP : disk group number
LXN_KFFXP: 0: primary extent/ 1: mirrored extent
DISK_KFFXP: disk where the extent has been created
NUMBER_KFFXP : ASM file number

select disk_kffxp, lxn_kffxp, count(*) from x$kffxp x, v$asm_file v
where group_kffxp=1 and x.number_kffxp=v.file_number and xnum_kffxp!=65534
group by disk_kffxp, lxn_kffxp;


 1          0       1050

 1          1       1050

 2          0       1050

 2          1       1050

The number of primary and mirrored extents is balanced on each of the disks of group 1. The logic is that whilst write access will equally hit both types of extents, reads will only access the primary extents. It makes sense to spread primary extents over the two failure groups to maximize throughput when the two failure groups are on an equal foot: same throughput and same I/O response time.

But what happens when the boxes are in different locations? As each instance will read on both arrays, the distance factor will introduce delays on half of the reads. The delay is a function of the distance, but we also need to take into account that the number of storage hops the reads must traverse will increase when remote sites are to be reached. Even worse, the inter sites links are bound to congestion as they must accommodate oversubscription and link contention.

Whenever possible, a clever use of a RAC data locality will reduce the RAC interconnect burden and reduce reliance on cache fusion. It can also in 11G reduce the inter site link throughput requirements and possibly speed up reads by reading data locally, assuming the local array is not itself overloaded.

The preferred read is an instance wide initialization parameter:

ASM_PREFERRED_READ_FAILURE_GROUPS = diskgroup_name.failure_group_name,…

ASM extent layout reference: Take a look at Luca Canali’s work: a peek inside Oracle ASM metadata

— Updated: check the comment below for Luca’s updated links – very interesting indeed —

Blog at