Christian Bilien’s Oracle performance and tuning blog

June 26, 2007

Log file write time and the physics of distance

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 7:46 pm

I already wrote a couple of notes about the replication options available when a production is made of different storage arrays (see “Spotlight on Oracle replication options within a SAN (1/2)” and Spotlight on Oracle replication options within a SAN (2/2)).

These posts came from a real life experience, where both storage arrays were “intuitively” close enough to each other to ignore the distance factor. But what if the distance is increased? The trade-off seems obvious: the greater the distance, the lower the maximum performance. But what is the REAL distance factor? Not so bad in theory.

I’m still interested in the first place by synchronous writes, namely log file writes and associated “log file sync” waits. I want to know how distance influences the log file write time in a Volume manager (HP-UX LVM, Symantec VxVM, Solaris VM or ASM) mirroring. EMC SRDF and HP ‘s Continuous Access (XP or EVA) synchronous writes could also be considered but their protocol seems to need 2 round trips per host I/O. I’ll leave this alone pending some more investigation.

The remote cache must in both cases acknowledge the I/O to the local site to allow the LGWR’s I/O to complete.

1. Load time and the zero distance I/O completion time.

Load time:

The speed of light in fiber is about 5 microseconds per kilometer, which means 200km costs 1ms one way. The load time is the time for a packet to completely pass any given point in a SAN. A wider pipe allows a packet to be delivered faster than a narrow pipe.

The load time can also be thought as the length of the packet in kilometers: the greater the bandwidth, the smaller the packet length, and the smaller the packet load time. At 2Gb/s, a 2KB packet (the typical log write size) is about 2kms long, but it would be 2600 km long for a 1.5Mb/s slow link.

Zero distance I/O completion time

The zero distance I/O completion time is made of two components:

  • A fixed overhead, commonly around 0.5 ms (the tests made in the Spotlight on Oracle replication options within a SAN (2/2) and reproduced below on fig.1 corroborates the fact that the I/O time on a local device is only increased by 10% when the packet size more than doubles). This represents storage array processor time and any delay on the host ports for the smallest packet.
  • The load time, a linear function of the packet size.

At the end of the day, the zero distance I/O completion time is :

Slope x Packet size + overhead

Here is one of the measurements I reported in the “Spotlight on Oracle replication post” :

Figure 1 : Measured I/O time as a function of the write size for log file writes

Write size (k) I/O time (ms)
   
2 0,66
5 0,74

A basic calculation gives :

Slope = (5-2)/(0,74-0,66)=0,027
Overhead = 0,6 ms

Figure 2 : Effect of the frame size on zero distance I/O completion time :

Frame size (k)

Time to load

2

0,65

16

1,03

32

1,46

64

2,33

128

4,06

 

A small frame such as a log write will heavily depend upon the overhead, while the slope (which itself is a linear function of the throughput) is predominant for large frames.

2. Synchronous I/O time

The transfer round trip (latency) is the last component of the time to complete a single I/O write over distance. It is equal to

2xDistance (km) x 5µsec/km

Figure 3: Time to complete a 2K synchronous write (in ms)

km

Round trip latency

Time to load

Overhead

Time to complete the log write

10

0,1

0,654

0,6

1,354

20

0,2

0,654

0,6

1,454

30

0,3

0,654

0,6

1,554

40

0,4

0,654

0,6

1,654

50

0,5

0,654

0,6

1,754

60

0,6

0,654

0,6

1,854

70

0,7

0,654

0,6

1,954

80

0,8

0,654

0,6

2,054

90

0,9

0,654

0,6

2,154

100

1

0,654

0,6

2,254

110

1,1

0,654

0,6

2,354

120

1,2

0,654

0,6

2,454

130

1,3

0,654

0,6

2,554

140

1,4

0,654

0,6

2,654

150

1,5

0,654

0,6

2,754

This is quite interesting as the log writes are only about twice as slow when you multiply by 15 the distance.

June 19, 2007

Spotlight on Oracle replication options within a SAN (2/2)

Filed under: Oracle,Solaris,Storage — christianbilien @ 7:57 pm

This post is a follow up to “Spotlight on Oracle replication options within a SAN (1/2)”. This first post was about the available replication options.

I will address in this post a specific performance aspect for which I am very concerned for one of my customers. This is an organization where many performance challenges come down to the commit wait time: the applications trade at the millisecond level which translates in data base log file syncs expressed in hundredth of microseconds. It is a basic DRP requirement that applications must be synchronously replicated over a 2,5 kms (1.5 miles) Fiber Channel network between a local and a remote EMC DMX 1000 storage array. The mutipathing software is Powerpath, the DMX1000 volumes may be mirrored from the local array to the remote by either VxVm, ASM or SRDF.

Two options may be considered:

  • Host based (Veritas VxVM, Solaris Disk Suite or ASM) replication
  • Synchronous SRDF replication

All options may not always be available as RAC installations over the two sites will require a host based replication. On the other hand, simple replication with no clustering may either use SRDF of a volume manager replication.

I made some unitary tests aimed at qualifying the SRDF protocol vs. a volume manager replication. Let us just recall that an SRDF mirrored I/O will go in the local storage array cache, and will be acknowledged to the calling program only when the remote cache has been updated. A VM is no less different in principle: the Powerpath policy dictates that both storage arrays must acknowledge the I/O before the calling program considers it is completed.

Test conditions:

  • This is a unitary test. It is not designed to reflect an otherwise loaded or saturated environment. The conclusion will however shed some light on what’s happening under stress.
  • This test is specifically designed to show what’s happening when performing intense log file writes. The log file write size is usually 2k, but I saw it going up to 5k.
  • The test is a simple dd if=/dev/zero of=<target special file> bs=<block size> count=<count>. Reading from /dev/zero ensures that no read waits occurs.

Baseline: Local Raw device on a DMX 1000
Throughput=1 powerpath link throughput x 2

Block size (k)

I/O/s

I/O time (ms)

MB/s

2

1516

0,66

3,0

5

1350

0,74

6,6

Test 1: Distant Raw device on a DMX
Throughput=1 powerpath link throughput x 2

Block size (k)

I/O/s

I/O time (ms)

MB/s

2

1370

0,73

2,7

5

1281

0,78

6,3

The distance degradation is less than 10%. This is the I/O time and throughput I expect when I mirror the array volumes by VxVM or ASM.

Test 2: Local raw device on a DMX, SRDF mirrored
Throughput=1 powerpath link throughput x 2

Block size (k)

I/O/s

I/O time (ms)

MB/s

2

566

1,77

1,1

5

562

1,78

2,7

This is where it gets interesting: SRDF will double the I/O time and halve the throughput.

 

Conclusion: When you need log file write performance in order to minimize the log file sync wait times, use a volume manager (including ASM) rather than SRDF. I believe this kind of result can also be expected under either the EVA or XP Continuous Access. The SRDF mirrored I/O are even bound to be more impacted by an increasing write load on the storage arrays as mirroring is usually performed via dedicated ports, which bear the load of all of the writes sent to the storage array. This bottleneck does not exist for the VM replication.

 

 

June 14, 2007

Spotlight on Oracle replication options within a SAN (1/2)

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 7:40 pm

Some interesting issues face the many sites wishful to implement a replication for data bases between two distant sites. One of the major decisions to be taken is HOW the replication will be performed, in other words what are the options and their pro and cons? I’ll start with generalities and then present some unitary tests performed in a Solaris/ASM/VxVM/EMC DMX environment.

1. The initial consideration is synchronous vs. asynchronous replication.

Synchronous

  • Synchronous means that the I/O has to be posted on the remote site for the transaction to be validated. Array based replications, such as HP’s Continuous Access or EMC’s SRDF will post the I/O from the local array cache to the remote, then wait for the ack to come back before acknowledging the I/O to the calling program. The main component in the overall response is the times it takes to write from the local cache to the remote cache and for the acknowledgment to come back. This latency is of course not felt by read accesses, but write time is heavily impacted (see the tests at the bottom of this post). The applications heavily waiting on “log file sync” events are the most sensitive to the synchronous write mechanism. I am preparing a post about the distance factor, i.e. how distance impacts response times.
  • Another aspect of synchronous replication is the bottleneck the replication will go through. Assuming a couple of 2GB/s replication ports, the replication bandwidth will be 4GB/s. It will need to accommodate the whole storage array write throughput, thereby potentially increasing the latency because processors will be busier, I/O will wait on array cache flushes and on other latches, etc.

Asynchronous

To preserve consistency, asynchronous replication must implement some sequence-stamping that ensures that write operations at the remote node occur in the correct order. Loss may thus occur with EMC SRDF/A (A stands for adaptive) or HP’s CA asynchronous, but no data corruption should be experimented.

2. Host based vs. array based replication

Data Guard and volume managers (including the ASM) can be used to mirror the data base volumes from one array to the other one.

Data Guard

Data Guard works over TCP/IP.

Pro:

  • IP links are common, relatively cheap and easy to set up.

Cons:

  • Synchronous replication over IP means QOS (Quality Of Service) procedures to avoid other services clogging the links.
  • The commits must wait for the writes in the remote log file. The remote data base is asynchronously loaded from the remote log files. The more DML intensive the primary data base is, the wider the potential gap.

Volume management

Volume management is the only available options for some geographical clusters. RAC over Sun Cluster, RAC over ASM without 3rd party clusters, Mc ServiceGuard with the Cluster File System do not offer any other alternative (take a look at RAC geographical clusters and 3rd party clusters (HP-UX) for a discussion of RAC on geo clusters.

ASM is a also a volume manager as it is used for mirroring from one storage array to the other.

Pro:

  • Fast (see the unitary tests). They also work best on aggregate: all of the storage array replicated writes go through a set of dedicated ports, which ends up bottlenecking on some array processors when others are mostly idle. VM writes are spread all over the array processors. So both scalability and unitary write speed are in favor of volume management mirroring.

Cons:

  • Harder to manage and to maintain. Say that you want to configure an ASM with a lot of raid groups. Assuming the power_limit set to 0 prevents the automatic rebuild of the mirrored raid group because the rebuild would otherwise occur locally, you’ll have to add the newly created raid group into the rebuild script. Worse, you may forget it and realize one raid group is not mirrored the day the primary storage array fails. The most classic way to fail a cluster switchover is to forget to reference newly created file systems or tablespaces.
  • Usually works over Fiber Channel, although FC-IP can be used to extend the link distance.
  • No asynchronous replication except for the Veritas Volume Replicator which is to my knowledge the only VM able to perform async writes on a remote array.

Array based replication

Pro:

  • Usually easier to manager. The maintenance and switchover tasks may also be offloaded on the storage team. Host based replication management either puts the ball in the DBA camp (if using ASM) or to the sys admins (for other VM).
  • Asynchronous replication
  • Vendors offer remote monitoring
  • Snapshots can be made on the distant sites for development, report or other purposes.

Cons:

  • Performance as seen above.
  • Same limitations with the Fiber Channel.

 

Create a free website or blog at WordPress.com.