# Christian Bilien’s Oracle performance and tuning blog

## February 12, 2008

### The “log file sync” wait event is not always spent waiting for an I/O

Filed under: Oracle,Storage — christianbilien @ 4:41 pm

I went through a rather common problem some times ago, which shows how one can get easily fooled by the “log file sync” wait event. Here is a log file sync wait histogram which was interpreted as a poor storage array response time by the DBAs, often a good way to get off the hook of a performance problem:

This looks like a Gaussian distribution with a mean value of maybe 30ms. The same event recorded some weeks earlier shows a mean value located between 8 and 16 ms.

Each log file sync may wait up to an average of 101ms over an hour:

Event                        Avg wait Time (ms)
------------------------------------------
log file sync                       101

The storage array write destination is a rather loaded EMC DMX1000, and the multiplexed log files are SRDF mirrored. SRDF is the EMC protocol which replicates I/Os from one array to another one. The replication is synchronous here, meaning the remote cache array has to acknowledge each write to the local one before returning a completion status to the LGWR. The SRDF write time must be added to the average write time to reflect the array response time. Some overhead also exists in the OS layer and possibly at the SAN level. Both redo log groups have similar write access patterns. The SRDF write time is close to 30ms over this period of time, a very bad figure. The average write time on the array is 4ms.

This is where a bit of dig in may be interesting:

All calls are emitted from a single program which basically does some small transactions and commits. We’ll consider that no other conditions may trigger a write into the redo log other than a commit.

The average log file parallel write wait over the same hours is not good, but far from the by the 101ms log file sync:

Event                          Avg wait Time (ms)
------------------------------------------
log file parallel write              37

This fits with the array I/O times.

A quick recap on the ‘log file sync’ wait event may be necessary here:

The log file sync wait event is the time the foreground processes spends waiting for the redo to be flushed. It can be assumed here that the LGWR is always active, it does not have to be awaken by the system dispatcher.The log file sync wait can be broken into:

1. The ‘redo write time’: this is the total elapsed time of the write from the redo log buffer to the current redo log file (in centiseconds).
2. The “log file parallel write” is basically the time for the log write I/O to complete
3. The LGWR may have some post processing to do, then signals the waiting foreground process that the write has completed. The foreground process is finally waken up by the system dispatcher. This completes the ‘log file sync’ wait.

I also extracted two other log write statistics from the awr report:

Statistic          Total     per Second     per Trans
-----------------------------------------------------------------------------
redo write time  275,944           77.6           0.2
user commits   1,163,048          327.1           1.0

776/327 = 2.3ms per user commits of “redo write time”

Step 3 therefore accounts for 101-2.3-37=61,7ms. The most likely cause of the main contributor to the log file sync can be found here:

sar -q -f /var/adm/sa/sa07
         runq-sz %runocc swpq-sz %swpocc
05:00:01   15.9      48     0.0      0
05:05:01   24.1      42     0.0      0
05:10:01   25.2      46     0.0      0
05:15:00   15.5      38     0.0      0
05:20:01   15.6      43     0.0      0
05:25:00   14.4      38     0.0      0
05:30:00   17.2      36     0.0      0
05:35:00   18.0      47     0.0      0
05:40:00   16.9      38     0.0      0
05:45:01   12.3      33     0.0      0
05:50:00   10.1      41     0.0      0
05:55:00   11.4      39     0.0      0
06:00:00   11.1      47     0.0      0

Look at the first column, which shows the CPU run queue size. Many processes are competing for the CPUs. Although we could carry on the investigation by measuring how much time foreground processes are spending in the run queue, we can at this point conclude that the processes are spending about two thirds of the log file sync wait times waiting for the CPU, not for the I/O subsystem.

It would also be interesting for the sake of completeness to understand why the %runocc is relatively small compared to the average run queue size.

## December 3, 2007

### Where is the SAN admin ?

Filed under: Storage — christianbilien @ 10:07 pm

Many performance assessments start with the unpleasantness of having to guess a number of configuration items for lack of available information and/or knowledge. Whilst the server which hosts the database usually quickly delivers its little secrets, the storage configuration information is frequently more difficult to obtain from a remote administrator who has to manage thousands of luns. Many databases suffer from other I/O contributors to the storage network and arrays not even mentioning absurdities in the DB to storage array mapping.

Here are some little tricks which may be of interest to the information hungry analyst.

This case involves a slow Hyperion database. Of course this may seem quite remotely related to Oracle technologies (although Hyperion – the company – was part of the Oracle buying frenzy) but it still brings the thoughts and ideas I’d like to share.

The database is a 2Gbytes “cube” which only stores aggregations. The server configuration is a 4 cores rp3440/ 8GB memory running HP-UX 11iv2, the storage box is an EMC cx400 Clariion. The cube is stored on a single 4GB (!) lun Raid 5 9+1 (nine columns plus one parity stripe unit). The storage bit is outsourced, no throughput or I/O/s calibration requirements were done in the first place and the outsourcer probably gave the customer an available lun without further considerations (I’m sure this sounds familiar to many readers). There is no performance tool on the Clariion. We know at least that the raid group is not used by other luns. As an Hyperion consultant has already been through the DB elements, we’ll only focus on providing the required I/O bandwidth to the DB.

We’ll try to answer to :

If we knew what the requested DB I/O rate was, which RAID configuration would we ask to the outsourcer ?

The figures I got at a glance on the server are stable over time:

From sar –u:

System: 30%
User: 20%
Wait for I/O: 50%
Idle (and not waiting for I/O): 0%

The memory freelist length shown by vmstat states than less than half the memory is in use.

From sar –d and Glance UX (an HP performance tool)

%busy: 100
average lun queue length: 40
average time spent in the lun queue: 25ms
average lun read service time: 6ms

As the lun is obviously going at its maximum I/O/s (for the given reads and writes), we can get a maximum I/O read rate per disk of 1/6ms = 167 reads/s. Here I made the assumption that reads would not spend a significant time being serviced by the array processors and the SAN does not introduce any additional delay.

We can also derive from the 800 reads/s an average OS read rate of about 80I/O/s per disk, which leaves 167-80=87 I/O/s per disk charged to the write calls. You may remember the “small write penalty” from Join the BAARF party..: one OS write will generate two physical reads and two physical writes on disks. Hence, we have 44 disk reads and 44 disk writes generated by the OS writes (I rounded up 43.5 to 44). This is approximately consistent with 20 OS writes/s = 40 disk reads + 40 disk writes if no stripe aggregation exists. The 10% margin of error can be ignored here.

Knowing of a maximum I/O rate of 167/s (this should not be taken too literally – it could be rounded to 150 for example — ). We can now play with various configurations by computing disk theoretical I/O rates. We’ll rule out the configurations for which the disk I/O rate exceeds a 150I/O/s threshold:

 Number of RAID 5 columns Lun I/O rate 10 12 14 16 1000 167 139 119 104 2000 333 278 238 208 3000 500 357 357 313 4000 667 556 476 417 5000 833 694 595 521

 Number of RAID 10 disks 20 24 28 32 Lun I/O rate 1000 61 51 44 38 2000 122 102 87 76 3000 183 153 131 115 4000 244 204 175 153 5000 306 255 218 191

 Lun I/O rate Raid 5 5 col + 4 LVM cols Raid 5 10 col + 4 LVM cols 1000 83 42 2000 167 83 3000 250 125 4000 333 167 5000 417 208

(*) 10 columns=20 disks

## October 22, 2007

### Tale of a not-so-acute risk assessment

Filed under: Storage — christianbilien @ 9:41 am

One of my customers experienced a storage outage this week end exceptional enough to be worth mentioning here:

I designed with the storage guys a 4 directors dual-fabric configuration, deemed to be the safest configuration you can think of in terms of SAN topology. As each server is connected to two fabrics, a “logical” problem (DSN bug for example) in one fabric would not impact the servers availability. Each fabric is spread over two sites, and each data base server is clustered between the two sites. Each director as a number of high availability features such as dual AC power.

A site failure triggers cluster failovers.

The risks, threats and vulnerabilities were assessed but… no one (including me of course) ever considered this category of bug on the Brocade switches:

http://sunsolve.sun.com/search/document.do?assetkey=1-26-101607-1

Sun Alert ID: 101607 (formerly 57687)

Synopsis: A Switch or Director With Fabric OS 4.2.x or Earlier May Panic if it Has Been Up For More Than 497.1 Days

Category: Availability

BugIDs: 6197589

Avoidance: Patch, Workaround

State: Resolved

Date Released: 24-Nov-2004, 30-Mar-2005

Date Closed: 30-Mar-2005

Date Modified: 30-Mar-2005, 01-Mar-2006

The “may panic” should actually be read as “will panic’! All switches were powered on exactly 497 days ago when the switches were installed. As the switch startup times were approximately identical on each site, both couples of switches on each site panicked at the same time this week end causing a first failover from site A to site B followed one hour later by a failback from B to A.

Moral of the story: just like Windows in the old days, reboot the SAN directors every x months (just joking – only proactive patching could have prevented this chain of incidents).

## August 16, 2007

### Workload characterization for the uninformed capacity planner

Filed under: HP-UX,Models and Methods,Oracle,Solaris,Storage — christianbilien @ 7:32 pm

Doug Burns initiated an interesting thread a while ago about user or application workloads, their meanings and the difficulties associated with their determination. But workload characterization is both essential and probably the hardest and most prone to error bit off the whole forecasting process. Models that fail to validate (i.e. are not usable) most of the time fall in one of these categories:

• The choice of characteristics and parameters is not relevant enough to describe the workloads and their variations
• The analysis and reduction of performance data was incorrect
• Data collection errors, misinterpretations, etc.

Unless you already know the business environment and the applications, or some previous workload characterization is already in place, you are facing a blank page. You can always try to do the smart workload partition along functional lines, but this effort is unfortunately often preposterous and doomed to failure because of time constraints. So what can be done?

I find the clustering analysis a good compromise between time to deliver and business transactions. Caveat: this method ignores any data cache (storage array, Oracle and File System cache, etc.) and locks/latches or any other waits unrelated to resource waits.

A simple example will explain how it works:

Let’s assume that we have a server with a single CPU and a single I/O path to a disk array. We’ll represent each transaction running on our server by a couple of attributes: the service time each of these transactions requires from the two physical resources.In other words, each transaction will require in absolute terms a given number of seconds of presence on the disk array and another number of seconds on the CPU. We’ll call a required serviced time a “demand on a service center” to avoid confusion. The sum of those two values would represent the response time on an otherwise empty system assuming no interaction occurs with any other external factor. As soon as you start running concurrent transactions, you introduce on one hand waits on locks, latches, etc. and on the other hand queues on the resources: the sum of the demands is no longer the response time. Any transaction may of course visit each resource several times: the sum of the times spent using each service center will simply equal the demand.

Let us consider that we are able to collect the demands each single transaction j requires from our two resource centers. We’ll name
${D}_{j1}$ the CPU demand and ${D}_{j2}$ the disk demand of transaction j. Transaction j can now be represented by a two components workload: ${w}_{j}=({D}_{j1},{D}_{j2})$. Let’s now start the collection. We’ll collect overtime every ${w}_{j}$ that goes on the system. Below is a real 300 points collection on a Windows server. I cheated a little bit because there are four CPUs on this machine but we’ll just say a single queue represents the four CPUs.

The problem is now obvious: there is no natural grouping of transactions with similar requirements. Another attempt can be made using Neperian logs to distort the scales:

This is not good enough either to identify meaningful workloads.

The Minimum Spanning Tree (MST) method can be used to perform successive fusions of data until the wanted number of representative workloads is obtained. It begins by considering each component of a workload to be a cluster of points. Next, the two clusters with the minimum distance are fused to form a cluster. The process iterates until the final number of desired clusters is reached.

• Distance: let’s assume two workloads represented by ${w}_{i}=({D}_{i1},{D}_{i2},...,{D}_{iK})$ and ${w}_{j}=({D}_{j1},{D}_{j2},...,{D}_{jK})$. I moved from just two attributes per workload to K attributes, which will correspond to service times at K service centers. The Euclidian distance between the two workloads will be $d=\sqrt[]{\sum_{n=1}^{K}({D}_{in}-{D}_{jK})}$.
• Each cluster is represented at each iteration by its centroid whose parameter values are the means of the parameter values of all points in the cluster.

Below is a 20 points reduction of the 300 initial points. In real life, thousands of points are used to avoid outliers and average the transactions

## July 2, 2007

### Asynchronous checkpoints (db file parallel write waits) and the physics of distance

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 5:15 pm

The first post ( “Log file write time and the physics of distance” ) devoted to the physic of distance was targeting log file writes and “log file sync” waits. It assumed that :

• The percentage of occupied bandwidth by all the applications which share the pipe was negligible
• No other I/O subsystem waits were occurring.
• The application streams writes, i.e. it is able to issue an I/O as soon as the channel is open.

This set of assumptions is legitimate if indeed an application is “waiting” (i.e. not consuming cpu) on log file writes but not on any other I/O related events and the fraction of available bandwidth is large enough for a frame not to be delayed by another applications which share the same pipe, such as an array replication.

Another common Oracle event is the checkpoint completion wait (db file parallel write). I’ll try to explore in this post how the replication distance factor influences the checkpoint durations. Streams of small transactions make the calling program synchronous from the write in the logfile, but checkpoints writes are much less critical by nature because they are asynchronous from the user program perspective. They only influence negatively the response time when “db file parallel write” waits start to appear. The word “asynchronous” could be a source of confusion, but it is not here. The checkpoints I/Os are doubly asynchronous, because the I/Os are also asynchronous at the DBWR level.

1. Synchronous writes: relationship of I/O/s to throughput and percent bandwidth

We did some maths in figure 3 in “Log file write time and the physics of distance” aimed at calculating the time to complete a log write. Let’s do the same with larger writes over a 50km distance on a 2Gb/s FC link. We’ll also add a couple of columns: the number of I/O/s and the fraction of used bandwidth. 2Gb/s = 200MB/s because the FC frame is 10 bytes long.

Figure 1: throughput and percent bandwidth as a function of the I/O size (synchronous writes)

 I/O size Time to load (ms) Round trip latency (ms) Overhead(ms) Time to complete an I/O (ms) IO/s Throughput (MB/s) Percent bandwidth 2 0,054 0,5 0,6 1,154 867 1,7 0,8% 16 0,432 0,5 0,6 1,532 653 10,2 5,1% 32 0,864 0,5 0,6 1,964 509 15,9 8,0% 64 1,728 0,5 0,6 2,828 354 22,1 11,1% 128 3,456 0,5 0,6 4,556 219 27,4 13,7% 256 6,912 0,5 0,6 8,012 125 31,2 15,6% 512 13,824 0,5 0,6 14,924 67 33,5 16,8%

So what change should we expect to the above results if we change from synchronous writes to asynchronous writes?

2. Asynchronous writes

Instead of firing one write at a time and waiting for completion before issuing the next one, we’ll stream writes one after the other, leaving no “gap” between consecutive writes.

Three new elements will influence the expected maximum number of I/O streams in the pipe:

• Channel buffer-to-buffer credits
• Number of outstanding I/O (if any) the controller can support. This is 32 for example for an HP EVA
• Number of outstanding I/O (if any) the system, or an scsi target can support. On HP-UX, the default number of I/Os that a single SCSI target will queue up for execution is for example 8, the maximum is 255.

Over 50kms, and knowing that the speed of light in fiber is about 5 microseconds per kilometer, the relationship between the I/O size and the packet size in the pipe is shown in figure 2:

Figure 2: between the I/O size and the packet size in the fiber channel pipe

 I/O size (kB) Time to load (µs) Packet length (km) 2 10,24 2 32 163,84 33 64 327,68 66 128 655,36 131 256 1310,72 262 512 2621,44 524

The packet length for 2KB writes requires a capacity of 25 outstanding I/Os to fill the 50km pipe, but only one I/O can be active for 128KB packets streams. Again, this statement only holds true if the “space” between frames is negligible.

Assuming a zero-gap between 2KB frames, an observation post would see an I/O pass through every 10µs, which corresponds to 100 000 I/O/s. We are here leaving the replication bottleneck as other limiting factors such as at the storage array and computers at both end will now take precedence. However, a single 128KB packet will be in the pipe at a given time: the next has to wait for the previous to complete. Sounds familiar, doesn’t it ? When the packet size exceeds the window size, replication won’t give any benefit to asynchronous I/O writes, because asynchronous writes behave synchronously.

## June 26, 2007

### Log file write time and the physics of distance

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 7:46 pm

I already wrote a couple of notes about the replication options available when a production is made of different storage arrays (see “Spotlight on Oracle replication options within a SAN (1/2)” and Spotlight on Oracle replication options within a SAN (2/2)).

These posts came from a real life experience, where both storage arrays were “intuitively” close enough to each other to ignore the distance factor. But what if the distance is increased? The trade-off seems obvious: the greater the distance, the lower the maximum performance. But what is the REAL distance factor? Not so bad in theory.

I’m still interested in the first place by synchronous writes, namely log file writes and associated “log file sync” waits. I want to know how distance influences the log file write time in a Volume manager (HP-UX LVM, Symantec VxVM, Solaris VM or ASM) mirroring. EMC SRDF and HP ‘s Continuous Access (XP or EVA) synchronous writes could also be considered but their protocol seems to need 2 round trips per host I/O. I’ll leave this alone pending some more investigation.

The remote cache must in both cases acknowledge the I/O to the local site to allow the LGWR’s I/O to complete.

1. Load time and the zero distance I/O completion time.

The speed of light in fiber is about 5 microseconds per kilometer, which means 200km costs 1ms one way. The load time is the time for a packet to completely pass any given point in a SAN. A wider pipe allows a packet to be delivered faster than a narrow pipe.

The load time can also be thought as the length of the packet in kilometers: the greater the bandwidth, the smaller the packet length, and the smaller the packet load time. At 2Gb/s, a 2KB packet (the typical log write size) is about 2kms long, but it would be 2600 km long for a 1.5Mb/s slow link.

Zero distance I/O completion time

The zero distance I/O completion time is made of two components:

• A fixed overhead, commonly around 0.5 ms (the tests made and reproduced below on fig.1 corroborates the fact that the I/O time on a local device is only increased by 10% when the packet size more than doubles). This represents storage array processor time and any delay on the host ports for the smallest packet.
• The load time, a linear function of the packet size.

At the end of the day, the zero distance I/O completion time is :

Slope x Packet size + overhead

Here is one of the measurements I reported in the “Spotlight on Oracle replication post” :

Figure 1 : Measured I/O time as a function of the write size for log file writes

 Write size (k) I/O time (ms) 2 0,66 5 0,74

A basic calculation gives :

Slope = (5-2)/(0,74-0,66)=0,027

Figure 2 : Effect of the frame size on zero distance I/O completion time :

 Frame size (k) Time to load 2 0,65 16 1,03 32 1,46 64 2,33 128 4,06

A small frame such as a log write will heavily depend upon the overhead, while the slope (which itself is a linear function of the throughput) is predominant for large frames.

2. Synchronous I/O time

The transfer round trip (latency) is the last component of the time to complete a single I/O write over distance. It is equal to

2xDistance (km) x 5µsec/km

Figure 3: Time to complete a 2K synchronous write (in ms)

 km Round trip latency Time to load Overhead Time to complete the log write 10 0,1 0,654 0,6 1,354 20 0,2 0,654 0,6 1,454 30 0,3 0,654 0,6 1,554 40 0,4 0,654 0,6 1,654 50 0,5 0,654 0,6 1,754 60 0,6 0,654 0,6 1,854 70 0,7 0,654 0,6 1,954 80 0,8 0,654 0,6 2,054 90 0,9 0,654 0,6 2,154 100 1 0,654 0,6 2,254 110 1,1 0,654 0,6 2,354 120 1,2 0,654 0,6 2,454 130 1,3 0,654 0,6 2,554 140 1,4 0,654 0,6 2,654 150 1,5 0,654 0,6 2,754

This is quite interesting as the log writes are only about twice as slow when you multiply by 15 the distance.

## June 19, 2007

### Spotlight on Oracle replication options within a SAN (2/2)

Filed under: Oracle,Solaris,Storage — christianbilien @ 7:57 pm

This post is a follow up to “Spotlight on Oracle replication options within a SAN (1/2)”. This first post was about the available replication options.

I will address in this post a specific performance aspect for which I am very concerned for one of my customers. This is an organization where many performance challenges come down to the commit wait time: the applications trade at the millisecond level which translates in data base log file syncs expressed in hundredth of microseconds. It is a basic DRP requirement that applications must be synchronously replicated over a 2,5 kms (1.5 miles) Fiber Channel network between a local and a remote EMC DMX 1000 storage array. The mutipathing software is Powerpath, the DMX1000 volumes may be mirrored from the local array to the remote by either VxVm, ASM or SRDF.

Two options may be considered:

• Host based (Veritas VxVM, Solaris Disk Suite or ASM) replication
• Synchronous SRDF replication

All options may not always be available as RAC installations over the two sites will require a host based replication. On the other hand, simple replication with no clustering may either use SRDF of a volume manager replication.

I made some unitary tests aimed at qualifying the SRDF protocol vs. a volume manager replication. Let us just recall that an SRDF mirrored I/O will go in the local storage array cache, and will be acknowledged to the calling program only when the remote cache has been updated. A VM is no less different in principle: the Powerpath policy dictates that both storage arrays must acknowledge the I/O before the calling program considers it is completed.

Test conditions:

• This is a unitary test. It is not designed to reflect an otherwise loaded or saturated environment. The conclusion will however shed some light on what’s happening under stress.
• This test is specifically designed to show what’s happening when performing intense log file writes. The log file write size is usually 2k, but I saw it going up to 5k.
• The test is a simple dd if=/dev/zero of=<target special file> bs=<block size> count=<count>. Reading from /dev/zero ensures that no read waits occurs.

Baseline: Local Raw device on a DMX 1000
Throughput=1 powerpath link throughput x 2

 Block size (k) I/O/s I/O time (ms) MB/s 2 1516 0,66 3,0 5 1350 0,74 6,6

Test 1: Distant Raw device on a DMX
Throughput=1 powerpath link throughput x 2

 Block size (k) I/O/s I/O time (ms) MB/s 2 1370 0,73 2,7 5 1281 0,78 6,3

The distance degradation is less than 10%. This is the I/O time and throughput I expect when I mirror the array volumes by VxVM or ASM.

Test 2: Local raw device on a DMX, SRDF mirrored
Throughput=1 powerpath link throughput x 2

 Block size (k) I/O/s I/O time (ms) MB/s 2 566 1,77 1,1 5 562 1,78 2,7

This is where it gets interesting: SRDF will double the I/O time and halve the throughput.

Conclusion: When you need log file write performance in order to minimize the log file sync wait times, use a volume manager (including ASM) rather than SRDF. I believe this kind of result can also be expected under either the EVA or XP Continuous Access. The SRDF mirrored I/O are even bound to be more impacted by an increasing write load on the storage arrays as mirroring is usually performed via dedicated ports, which bear the load of all of the writes sent to the storage array. This bottleneck does not exist for the VM replication.

## June 14, 2007

### Spotlight on Oracle replication options within a SAN (1/2)

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 7:40 pm

Some interesting issues face the many sites wishful to implement a replication for data bases between two distant sites. One of the major decisions to be taken is HOW the replication will be performed, in other words what are the options and their pro and cons? I’ll start with generalities and then present some unitary tests performed in a Solaris/ASM/VxVM/EMC DMX environment.

1. The initial consideration is synchronous vs. asynchronous replication.

Synchronous

• Synchronous means that the I/O has to be posted on the remote site for the transaction to be validated. Array based replications, such as HP’s Continuous Access or EMC’s SRDF will post the I/O from the local array cache to the remote, then wait for the ack to come back before acknowledging the I/O to the calling program. The main component in the overall response is the times it takes to write from the local cache to the remote cache and for the acknowledgment to come back. This latency is of course not felt by read accesses, but write time is heavily impacted (see the tests at the bottom of this post). The applications heavily waiting on “log file sync” events are the most sensitive to the synchronous write mechanism. I am preparing a post about the distance factor, i.e. how distance impacts response times.
• Another aspect of synchronous replication is the bottleneck the replication will go through. Assuming a couple of 2GB/s replication ports, the replication bandwidth will be 4GB/s. It will need to accommodate the whole storage array write throughput, thereby potentially increasing the latency because processors will be busier, I/O will wait on array cache flushes and on other latches, etc.

Asynchronous

To preserve consistency, asynchronous replication must implement some sequence-stamping that ensures that write operations at the remote node occur in the correct order. Loss may thus occur with EMC SRDF/A (A stands for adaptive) or HP’s CA asynchronous, but no data corruption should be experimented.

2. Host based vs. array based replication

Data Guard and volume managers (including the ASM) can be used to mirror the data base volumes from one array to the other one.

Data Guard

Data Guard works over TCP/IP.

Pro:

• IP links are common, relatively cheap and easy to set up.

Cons:

• Synchronous replication over IP means QOS (Quality Of Service) procedures to avoid other services clogging the links.
• The commits must wait for the writes in the remote log file. The remote data base is asynchronously loaded from the remote log files. The more DML intensive the primary data base is, the wider the potential gap.

Volume management

Volume management is the only available options for some geographical clusters. RAC over Sun Cluster, RAC over ASM without 3rd party clusters, Mc ServiceGuard with the Cluster File System do not offer any other alternative (take a look at RAC geographical clusters and 3rd party clusters (HP-UX) for a discussion of RAC on geo clusters.

ASM is a also a volume manager as it is used for mirroring from one storage array to the other.

Pro:

• Fast (see the unitary tests). They also work best on aggregate: all of the storage array replicated writes go through a set of dedicated ports, which ends up bottlenecking on some array processors when others are mostly idle. VM writes are spread all over the array processors. So both scalability and unitary write speed are in favor of volume management mirroring.

Cons:

• Harder to manage and to maintain. Say that you want to configure an ASM with a lot of raid groups. Assuming the power_limit set to 0 prevents the automatic rebuild of the mirrored raid group because the rebuild would otherwise occur locally, you’ll have to add the newly created raid group into the rebuild script. Worse, you may forget it and realize one raid group is not mirrored the day the primary storage array fails. The most classic way to fail a cluster switchover is to forget to reference newly created file systems or tablespaces.
• Usually works over Fiber Channel, although FC-IP can be used to extend the link distance.
• No asynchronous replication except for the Veritas Volume Replicator which is to my knowledge the only VM able to perform async writes on a remote array.

Array based replication

Pro:

• Usually easier to manager. The maintenance and switchover tasks may also be offloaded on the storage team. Host based replication management either puts the ball in the DBA camp (if using ASM) or to the sys admins (for other VM).
• Asynchronous replication
• Vendors offer remote monitoring
• Snapshots can be made on the distant sites for development, report or other purposes.

Cons:

• Performance as seen above.
• Same limitations with the Fiber Channel.

## March 23, 2007

### Storage array bottlenecks

Filed under: Storage — christianbilien @ 8:39 pm

Even if the internal communications of a switch may be “oversubscribed” when the aggregate speed of the ports exceeds the internal bandwidth of the device, many switch devices (at least the director class) are “non-blocking”, meaning that all ports can operate at full speed simultaneously. I’ll write a post one day on SAN bottlenecks, but for now here is a view of the main hardware bottlenecks encountered in storage arrays:

Host ports:

The available bandwidth is 2 or 4Gb/s (200 or 400MB/s – FC frames are 10 bytes long -) per port. As load balancing software (Powerpath, MPXIO, DMP, etc.) are most of the times used both for redundancy and load balancing, I/Os coming from a host can take advantage of an aggregated bandwidth of two ports. However, reads can use only one path, but writes are duplicated, i.e. a host write ends up as one write on each host port.

Below is an example of a couple of host ports on an EMC DMX1000 (2Gb/s host ports).

Thanks to Powerpath, the load is well spread over the two ports. Both ports are running at about half of the bandwidth (but queuing theory shows that queues would start to be non negligible when the available bandwidth reaches 50%).

Array service processors

Depending on the hardware maker, service processors may be either bound Not all SPs are bound toto specific volumes, or may be accessed by any of the array SP. I wrote a blog entry sometimes ago about SP binding. Higher end arrays such as DMX and XP are not do not bind Luns to SPs, whilst Clariion and EVA do.

Back end controllers

Back end controllers are the access point for disk FC loops. Backend controllers also have a given throughput, usually limited anyway by the fact that at a given point in time, the dual ported disks within the FC-AL loop only allow 2 senders and 2 receivers. Below is a DMX1000 controller utilization rate, where almost all disks are running at a minimum of 60% of their available bandwidth, with 30 RAID 10 disks in each loop.

Disks

From a host standpoint, disks can sustain a much higher utilization rate when they are behind a cache than when they are accessed directly: remember than a disk running at a utilization rate of 50% will queue on average one I/O out of 2 (seen from the host, the I/O service time will be on average 50% higher than the disk service). It is not uncommon to measure disks utilization rates of nearly 100%. This will only becomes a problem when the array cache stops buffering the I/Os because space is exhausted.

## March 22, 2007

### A handy Solaris 10 command: fcinfo

Filed under: Solaris,Storage — christianbilien @ 4:14 pm

One the most useful new command I found in Solaris 10 is fcinfo, a command line interface that will display information on HBA ports on a host, but also many useful bits of information on connected storage remote port WWN, raid type, link status,etc.

root # fcinfo hba-port -l

HBA Port WWN: 10000000c957d408 ==> Local HBA1

OS Device Name: /dev/cfg/c4

Manufacturer: Emulex

Model: LP11000-E

Type: N-port

State: online

Supported Speeds: 1Gb 2Gb 4Gb

Current Speed: 2Gb

Node WWN: 20000000c957d408

Loss of Sync Count: 37

Loss of Signal Count: 0

Primitive Seq Protocol Error Count: 0

Invalid Tx Word Count: 32

Invalid CRC Count: 0

HBA Port WWN: 10000000c957d512==> Local HBA2

OS Device Name: /dev/cfg/c5

Manufacturer: Emulex

Model: LP11000-E

Type: N-port

State: online

Supported Speeds: 1Gb 2Gb 4Gb

Current Speed: 2Gb

Node WWN: 20000000c957d512

Loss of Sync Count: 41

Loss of Signal Count: 0

Primitive Seq Protocol Error Count: 0

Invalid Tx Word Count: 32

Invalid CRC Count: 0

/root # fcinfo remote-port -sl -p 10000000c957d512 ==> Which luns are seen by HBA2 ?

Remote Port WWN: 5006016839a0166a

Active FC4 Types: SCSI

SCSI Target: yes

Node WWN: 50060160b9a0166a

Loss of Sync Count: 1

Loss of Signal Count: 11

Primitive Seq Protocol Error Count: 0

Invalid Tx Word Count: 510

Invalid CRC Count: 0

LUN: 0

Vendor: DGC

Product: RAID 10

LUN: 1

Vendor: DGC

Product: RAID 5

LUN: 2

Vendor: DGC

Product: RAID 10