Christian Bilien’s Oracle performance and tuning blog

February 10, 2007

Join the BAARF Party (or not) (1/2)

Filed under: Oracle,Storage — christianbilien @ 6:38 pm

This is the title of the last chapter of “Oracle Insights, Tales of the Oak Table”, which I just completed reading. The “Battle Against Any Raid Five” ( was also mentionned in “Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning”. So is Raid 5 really so bad ? I’ll start with generalities, and I’ll put in a second post some tests of stripe aggregation I did

1. Large reads, i.e. reads of a full stripe group, can be done in parallel, by reading from the four disks that contain the stripe group.

2. Small reads, i.e. reads of one stripe unit, exhibit good performance because they only tie up one disk, therefore allowing other small reads to other disks to proceed in parallel.

3. Large writes, i.e. writes of a full stripe group require writing into five disks : the four data disks and the parity disk for the stripe unit. Raid 5 MR3 optimization is implemented for example in EMC Clariions such as the CX700 : the optimization works to delay writing to the cache until a RAID 5 stripe has filled, at which time a modified RAID 3 write (MR3) is performed. The steps are : entire stripe is XORed in memory to produce a parity segment, and the entire stripe, including new parity is written to the disks. In comparaison to a mirrored stripe scheme (RAID10), when writing a full stripe, the RAID 5 engine will write to N+1 disks, the RAID10 to 2*N disks.

– Large I/O stripe detection : If a large I/O is received, the RAID engine detects whether it can fill a stripe, and write the stripe out in MR3 fashion.

– The RAID engine detects data being written to the LUN that is sequential and delay flushing cache pages for a stripe until the sequential writes have filled a stripe.

4. Small writes, i.e. writes to one stripe, require that the parity block for the entire stripe unit be recomputed. Thus, a small write require reading the stripe unit and the parity block (hopefully in parallel), computing the new parity block and writing the new stripe unit and the new parity block in parallel. On a RAID 5 4+1, a write to one stripe unit actually requires four I/Os. This is known as the small write penalty for RAID 5 disks.

5. If the cache is saturated, the RAID 1+0 allows more writes to that system before cache flushing increases reponse time

6. Alignment : It is always desirable (but seldom feasible) to align I/Os on stripe elements boundaries to avoid disk crossings (RAID stripe misaligment). A single I/O split across two disks will actually incur two I/Os on two different stripes. This is even more costly with RAID 5 as there is an additional stripe’s worth of parity to calculate. On Intel architecture systems (at least the Xeon, it is likely to be different for the Itanium), the placement of the Master Boot Record (MBR) at the beginning of each logical device causes subsequent data structure to be misaligned by 63 sectors (or 512K block). The LUN aligment offset must be specified in Navishere to overcome this problem.



  1. […] Before jumping on the stripe aggregation tests, you may find it useful to read the first post I wrote on striping and RAID. […]

    Pingback by Join the BAARF Party (or not) (2/2) « Christian Bilien’s Oracle performance and tuning blog — April 21, 2007 @ 7:23 pm

  2. […] I/O/s per disk charged to the writes. You may remember the “small write penalty” from Join the BAARF party..: one OS write will generate two physical reads and two physical writes on disks. Hence, we have 44 […]

    Pingback by Where is the SAN admin ? « Christian Bilien’s Oracle performance and tuning blog — December 3, 2007 @ 10:51 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Create a free website or blog at

%d bloggers like this: