Christian Bilien’s Oracle performance and tuning blog

March 21, 2007

Storage array service processors (and controllers) performance

Filed under: Storage — christianbilien @ 9:30 pm

Depending on the hardware maker, service processors may be either bound to specific volumes, or may be accessed by any of the array SP. Whilst the second category of SP balance the incoming load, the first requires careful planning: although luns may be accessed through both controllers, there are performance differences depending on which path is used. Within the array, each virtual disk is owned by one of the controllers which provides the most direct path to the virtual disk. EMC Clariion and HP’s EVA both belong to this category, while higher end arrays such as EMC DMX and HP’s XP SPs do not require a volume to SP binding at configuration time.

Read I/Os to the owning controller are executed and the data is returned directly to the host. Read I/Os to the non-owning controller must be passed to the owning controller for execution. The data is read and then passed back to the non-owning controller for return to the host.

Because write I/Os always involves both controller for cache mirroring, write performance will be the same regardless of which controller receives the I/O.

Ownership transitions

Both the Clariion and the EVA will allow explicit ownership transition from one controller to the other. However, implicit transfer may also occur, if one of the following conditions is met:

  • EVA only: the array detects high percentage of read I/O processed by the non-owning controller
  • Both arrays: controller fails and the remaining controller assumes ownership of the luns.

 

Clariion Lun to SPn mapping:

You can find for a given Clariion WWN the port number and SP on which the lun is connected:

fcinfo on a Solaris gives for LUN0:

LUN: 0
Vendor: DGC
Product: RAID 10
OS Device Name: /dev/rdsk/c5t5006016839A0166Ad0s2

The special file name contains the Clariion WWN. Ignore 5006016 which always appear at the beginning of a Clariion WWN, and take the 8th digit (8 here). It can be a hex number from 0 to F, where values 0 through 7 are for SPA, and 8 through F are SP B. Values below 8 (SPA) can be used as such, while 8 must be subtracted from values between 8 and F. 5 is SPA5, c is SPB4.

 

Advertisements

February 16, 2007

Service time at Disk Arrays

Filed under: Models and Methods,Storage — christianbilien @ 7:20 pm

Reliability and increased performance in disk access are obtained by using RAID arrays. Many options and trade offs exist when you come to those disk arrays organization. Models can be easily built in order to avoid common pitfalls, justify RAID10, etc. As I could not write a post entry because of the math types figures, I went for a note: it is intended to present the basic issues of I/O bottleneck removals and the fundamental laws on which the I/O models are based: http://www.enezis.com/Disk_maths.pdf

February 11, 2007

Join the BAARF Party (or not) (2/2)

Filed under: Oracle,Storage — christianbilien @ 9:33 pm

The mere fact that full stripe aggregation can be done by modern storage arrays (as claimed by vendors) removes the RAID 5 overhead, making RAID10 less attractive (or not attractive at all), thereby dramatically reducing storage costs.

Is it true ?

Before jumping on the stripe aggregation tests, you may find it useful to read the first post I wrote on striping and RAID.

I tried to prove on pure sequential writes on an EMC cx600 that full stripe aggregation exists for small stripes (5*64K=360K) on a Raid 5 5+1 (i.e. almost no small write penalty has to be paid). However, once you increase the stripe size (up to a RAID 5 10+1), the full stripe aggregation just vanished.

Test conditions:

The Navisphere Performance Analyzer is used for measuring disks throughputs. It does not provide any metric that show whether full stripe aggregation is performed or not. So I just generated write bursts, and did some maths (I’ll develop this in another blog entry later on this), based on the expected reads generated by raid 5 writes.

Number of reads generated by writes on a RAID 5 4+1 array, where:

n· : number of stripe units modified by a write request

r : Number of stripe units read as a result of a request to write stripe units.

n

r

Explanation

1

2

Read one stripe unit and the parity block

2

2

Read two additional stripe units to compute the parity

3

1

Read one more stripe units to compute the parity

4

0

No additional reads needed

You can compute that way the number of reads (r) as a function of the number of stripes (n).

<!–[if !vml]–>The stripe unit is always 64k, the number of columns in the raid group will be 5+1 in test 1, 11+1 in test 2.

Test 1:

Small stripes (360K) on a Raid 5 5+1

I/O generator: 64 write/s for and average throughput of 4050kB/s. The average I/O size is therefore 64Kb/s. No read occurs. The operating system buffer cache has been disabled.

Assuming write aggregation, writes should be performed as a 320KB (full stripe) unit. Knowing the Operating System I/O rate (64/s), and assuming that we aggregate 5 OS I/O in one, we can calculate that the Raid Group I/O rate should be 64/5 = 12,8I/O/s. Analyzer gives for a particular disk an average throughput of 13,1. Write aggregation also means that no reads should be generated. The analyzer reports 2,6 reads/s on average. Although very low, this shows that write aggregation may not always be possible depending of particular cache situations.

Test 2:

Large stripes (768K)

82MB are sequentially written every 10s in bursts.

1. Write throughput at the disk level is 12,6Write/s, read throughput=10,1/s. Knowing the no reads are being sent from the OS, those figures alone show that full stripe write aggregation cannot be done by the Clariion.

2. However, some aggregation (above 5 stripes per write) is done as read throughput would otherwise be even higher: no aggregation would mean an extra 25 extra reads per seconds. The small write penalty does not greatly vary anyway when the number of stripes per write is above 4.

February 10, 2007

Join the BAARF Party (or not) (1/2)

Filed under: Oracle,Storage — christianbilien @ 6:38 pm

This is the title of the last chapter of “Oracle Insights, Tales of the Oak Table”, which I just completed reading. The “Battle Against Any Raid Five” (http://www.baarf.com) was also mentionned in “Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning”. So is Raid 5 really so bad ? I’ll start with generalities, and I’ll put in a second post some tests of stripe aggregation I did

1. Large reads, i.e. reads of a full stripe group, can be done in parallel, by reading from the four disks that contain the stripe group.

2. Small reads, i.e. reads of one stripe unit, exhibit good performance because they only tie up one disk, therefore allowing other small reads to other disks to proceed in parallel.

3. Large writes, i.e. writes of a full stripe group require writing into five disks : the four data disks and the parity disk for the stripe unit. Raid 5 MR3 optimization is implemented for example in EMC Clariions such as the CX700 : the optimization works to delay writing to the cache until a RAID 5 stripe has filled, at which time a modified RAID 3 write (MR3) is performed. The steps are : entire stripe is XORed in memory to produce a parity segment, and the entire stripe, including new parity is written to the disks. In comparaison to a mirrored stripe scheme (RAID10), when writing a full stripe, the RAID 5 engine will write to N+1 disks, the RAID10 to 2*N disks.

– Large I/O stripe detection : If a large I/O is received, the RAID engine detects whether it can fill a stripe, and write the stripe out in MR3 fashion.

– The RAID engine detects data being written to the LUN that is sequential and delay flushing cache pages for a stripe until the sequential writes have filled a stripe.

4. Small writes, i.e. writes to one stripe, require that the parity block for the entire stripe unit be recomputed. Thus, a small write require reading the stripe unit and the parity block (hopefully in parallel), computing the new parity block and writing the new stripe unit and the new parity block in parallel. On a RAID 5 4+1, a write to one stripe unit actually requires four I/Os. This is known as the small write penalty for RAID 5 disks.

5. If the cache is saturated, the RAID 1+0 allows more writes to that system before cache flushing increases reponse time

6. Alignment : It is always desirable (but seldom feasible) to align I/Os on stripe elements boundaries to avoid disk crossings (RAID stripe misaligment). A single I/O split across two disks will actually incur two I/Os on two different stripes. This is even more costly with RAID 5 as there is an additional stripe’s worth of parity to calculate. On Intel architecture systems (at least the Xeon, it is likely to be different for the Itanium), the placement of the Master Boot Record (MBR) at the beginning of each logical device causes subsequent data structure to be misaligned by 63 sectors (or 512K block). The LUN aligment offset must be specified in Navishere to overcome this problem.

Storage array cache sizing

Filed under: Oracle,Storage — christianbilien @ 11:28 am

I often have to argue or even fight with storage providers about cache sizing. Here is a set of rules I apply in Proof Of Concepts and disk I/O modellization.

Write cache size:

1. Cache sizing : the real issue is not the cache size, but how fast the cache can flush to disk. In other words, assuming sustained IO, cache will fill and I/O will bottleneck is this condition is met : the rate of incoming I/O is greater than the cache ability to flush.

2. The size of a write cache matters when the array must handle a burst of write I/O. A larger cache can absorb bigger bursts of write, such as database checkpoints. The burst can then be contained without hitting forced flush.

3. Write caching from one SP to the other is normally activated for redundancy and single point of failure removal. That is, the write cache in each SP contains both the primary cached data for the logical units it owns as well as a secondary copy of cache data for the LUNs owned by its peer SP. In other words, SP1’s write cache hold a copy of SP2’s write cache and vice versa. Overall, the real write cache size (seen from a performance point of view) is half the write cache configuration.

4. Write caching is used for raid 5 full stripe agregation (when the storage firmware support this feature) and parity calculation, a very useful feature for many large environments. Lack of space must not force the cache to destage.

Read cache size:

1. Randoms reads are little chance to be in cache: the storage array read cache is unlikely to provide much value on top of the SGA and possibly the file system buffer cache (depending of FILESYSTEMIO_OPTIONS and file system direct IO mount options).

2. However, array read caches are very useful for prefetches. I can see three cases for this situation to occur :

  • Oracle direct reads (or direct path reads) are operating system asynchronous reads (I’ll write a new post on this later). Assuming a well designed backend, you will be able to use a large disk bandwidth to take advantage of large numbers of parallel disk I/O.
  • Buffered reads (imports for example) will take advantage both of file system and storage array read ahead operations.

Also take a look at http://www.oracle.com/technology/deploy/availability/pdf/oow2000_same.pdf

« Previous Page

Blog at WordPress.com.