Christian Bilien’s Oracle performance and tuning blog

December 13, 2007

One of Pavlov’s dogs (2/2)

Filed under: Oracle,RAC,Solaris — christianbilien @ 9:15 pm

I didn’t get much chance with the Pavlov’s dogs challenge : not even a single attempt at an explanation. Maybe it was too weird a problem (or more embarrassing it did not interest anyone! ).

A quick reminder about the challenge: netstat shows almost no activity on an otherwise loaded interconnect (500Mb/s to 1Gb/s inbound and similar values going outbound as seen on the Infiniband switches and calculated as the product of PX remote messages recv’d/send x parallel_execution_message_size).

Well, anyway, here is what I think is the answer: the key information I gave was that the clusterware was running RDS over Infiniband. Infiniband HCAs have an inherent advantage over standard Ethernet network interfaces: they embed RDMA, which means that all operations are handled without interrupting the CPUs. That’s because the sending nodes read and write to the receiving node using user space memory, without going through the usual I/O channel. TCP/IP NICs also cause a number of interrupts the CPUs have to process because TCP segments have to be reconstructed while other threads are running.

The most likely cause of the netstat blindness is just that it cannot see the packets because the CPUs are unaware of them.

To quote the Wikipedia Pavlov’s dogs article, “the phrase “Pavlov’s dog” is often used to describe someone who merely reacts to a situation rather than use critical thinking”. That’s exactly what I thought of myself when I was trying to put the blame on the setup instead of thinking twice about the “obvious” way of measuring a network packet throughput.

December 9, 2007

One of Pavlov’s dogs (1/2)

Filed under: Oracle,RAC,Solaris — christianbilien @ 7:44 pm

That’s what I thought of myself after spending an hour trying to figure out what I had done wrong in my setup.

So here is a little challenge I derived from this experience:

1. I set up a 5 node 10gR2 RAC over a mix of 5 Xeon and Opteron 4 cores. The interconnect is RDS over an HCA Infiniband. The setup was made on OpenSolaris. I know, RAC is not a supported on OpenSolaris but we were advised by Sun MicroSystems that the Infiniband layer was significantly faster on OpenSolaris that it was on Solaris 10. Some ftp tests indeed showed that even IPoIB was 20% faster on OpenSolaris than on Solaris 10. So I had to tweak the Oracle installer and could not use any of the Oracle Guis, but I got it working.

2. select /*+ full(c) full(o) */ * from soe.product_descriptions c,soe.product_information o where c.product_id=o.product_id and c.product_id>300000;

was run simultaneously 10 times on each instance (this is incidentally a data base generated by Swingbench ). The DOP varies from 10 to 40, but it does not make much difference as far as our challenge is concerned.

I then plotted the ‘PX remote messages recv’d’ and ‘PX remote messages sent’.

Parallel_execution_message_size=8192

I could get the following peak throughput (5s interval) by multiplying parallel_execution_message_size by the PX remote messages figures:

PX remote messages recv’d Mb/s PX remote messages sent Mb/s

792,82

1077,22

1035,78

756,21

697,01

885,18

I am not taking into account the GES and the GCS messages, nor did I count the cache fusion blocks. Both of them were small anyway. The weird thing came when I tried to measure the corresponding HCA Input / Output packet number from netstat:

Input Packets/s Output Packets/s

69

73

39

35

58

6

Almost no traffic on the interconnect (the HCA MTU is 2044 bytes) !

Let’s check that the interconnect is running over the intended HCA:

netstat -ni

Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs CollisQueue
lo0 8232 127.0.0.0 127.0.0.1 32364353 0 32364353 0 0
ibd0 2044 10.0.0.0 10.0.0.1 27588368 0 26782527 0 0
nge1 1500 10.x.x.x 10.x.x.x 640710585 0 363462595 0 0

SQL> oradebug setmypid
SQL> oradebug ipc
SQL> oradebug tracefile_name

[…]
SSKGXPT 0x64362c8 flags SSKGXPT_READPENDING socket no 8 IP 10.0.0.1
[…]

Just to be sure:

SQL> select indx,picked_ksxpia,ip_ksxpia from x$ksxpia;
INDX PICKED_KSXPIA IP_KSXPIA
————————————————————-
0 CI 10.0.0.1

Question: What could possibly justify the discrepancies between the netstat output and the values collected from v$px_sesstat ?

Hints: it is not a bug

 

August 16, 2007

Workload characterization for the uninformed capacity planner

Filed under: HP-UX,Models and Methods,Oracle,Solaris,Storage — christianbilien @ 7:32 pm

Doug Burns initiated an interesting thread a while ago about user or application workloads, their meanings and the difficulties associated with their determination. But workload characterization is both essential and probably the hardest and most prone to error bit off the whole forecasting process. Models that fail to validate (i.e. are not usable) most of the time fall in one of these categories:

  • The choice of characteristics and parameters is not relevant enough to describe the workloads and their variations
  • The analysis and reduction of performance data was incorrect
  • Data collection errors, misinterpretations, etc.

Unless you already know the business environment and the applications, or some previous workload characterization is already in place, you are facing a blank page. You can always try to do the smart workload partition along functional lines, but this effort is unfortunately often preposterous and doomed to failure because of time constraints. So what can be done?

I find the clustering analysis a good compromise between time to deliver and business transactions. Caveat: this method ignores any data cache (storage array, Oracle and File System cache, etc.) and locks/latches or any other waits unrelated to resource waits.

A simple example will explain how it works:

Let’s assume that we have a server with a single CPU and a single I/O path to a disk array. We’ll represent each transaction running on our server by a couple of attributes: the service time each of these transactions requires from the two physical resources.In other words, each transaction will require in absolute terms a given number of seconds of presence on the disk array and another number of seconds on the CPU. We’ll call a required serviced time a “demand on a service center” to avoid confusion. The sum of those two values would represent the response time on an otherwise empty system assuming no interaction occurs with any other external factor. As soon as you start running concurrent transactions, you introduce on one hand waits on locks, latches, etc. and on the other hand queues on the resources: the sum of the demands is no longer the response time. Any transaction may of course visit each resource several times: the sum of the times spent using each service center will simply equal the demand.

Let us consider that we are able to collect the demands each single transaction j requires from our two resource centers. We’ll name
{D}_{j1} the CPU demand and {D}_{j2} the disk demand of transaction j. Transaction j can now be represented by a two components workload: {w}_{j}=({D}_{j1},{D}_{j2}). Let’s now start the collection. We’ll collect overtime every {w}_{j} that goes on the system. Below is a real 300 points collection on a Windows server. I cheated a little bit because there are four CPUs on this machine but we’ll just say a single queue represents the four CPUs.

T1

The problem is now obvious: there is no natural grouping of transactions with similar requirements. Another attempt can be made using Neperian logs to distort the scales:

t2.gif

This is not good enough either to identify meaningful workloads.

The Minimum Spanning Tree (MST) method can be used to perform successive fusions of data until the wanted number of representative workloads is obtained. It begins by considering each component of a workload to be a cluster of points. Next, the two clusters with the minimum distance are fused to form a cluster. The process iterates until the final number of desired clusters is reached.

  • Distance: let’s assume two workloads represented by {w}_{i}=({D}_{i1},{D}_{i2},...,{D}_{iK}) and {w}_{j}=({D}_{j1},{D}_{j2},...,{D}_{jK}). I moved from just two attributes per workload to K attributes, which will correspond to service times at K service centers. The Euclidian distance between the two workloads will be d=\sqrt[]{\sum_{n=1}^{K}({D}_{in}-{D}_{jK})}.
  • Each cluster is represented at each iteration by its centroid whose parameter values are the means of the parameter values of all points in the cluster.

    Below is a 20 points reduction of the 300 initial points. In real life, thousands of points are used to avoid outliers and average the transactions

    t3.gif

July 2, 2007

Asynchronous checkpoints (db file parallel write waits) and the physics of distance

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 5:15 pm

The first post ( “Log file write time and the physics of distance” ) devoted to the physic of distance was targeting log file writes and “log file sync” waits. It assumed that :

  • The percentage of occupied bandwidth by all the applications which share the pipe was negligible
  • No other I/O subsystem waits were occurring.
  • The application streams writes, i.e. it is able to issue an I/O as soon as the channel is open.

This set of assumptions is legitimate if indeed an application is “waiting” (i.e. not consuming cpu) on log file writes but not on any other I/O related events and the fraction of available bandwidth is large enough for a frame not to be delayed by another applications which share the same pipe, such as an array replication.

Another common Oracle event is the checkpoint completion wait (db file parallel write). I’ll try to explore in this post how the replication distance factor influences the checkpoint durations. Streams of small transactions make the calling program synchronous from the write in the logfile, but checkpoints writes are much less critical by nature because they are asynchronous from the user program perspective. They only influence negatively the response time when “db file parallel write” waits start to appear. The word “asynchronous” could be a source of confusion, but it is not here. The checkpoints I/Os are doubly asynchronous, because the I/Os are also asynchronous at the DBWR level.

1. Synchronous writes: relationship of I/O/s to throughput and percent bandwidth

We did some maths in figure 3 in “Log file write time and the physics of distance” aimed at calculating the time to complete a log write. Let’s do the same with larger writes over a 50km distance on a 2Gb/s FC link. We’ll also add a couple of columns: the number of I/O/s and the fraction of used bandwidth. 2Gb/s = 200MB/s because the FC frame is 10 bytes long.

 

Figure 1: throughput and percent bandwidth as a function of the I/O size (synchronous writes)

I/O size

Time to

load (ms)

Round trip

latency (ms)

Overhead(ms)

Time to complete

an I/O (ms)

IO/s

Throughput

(MB/s)

Percent

bandwidth

2

0,054

0,5

0,6

1,154

867

1,7

0,8%

16

0,432

0,5

0,6

1,532

653

10,2

5,1%

32

0,864

0,5

0,6

1,964

509

15,9

8,0%

64

1,728

0,5

0,6

2,828

354

22,1

11,1%

128

3,456

0,5

0,6

4,556

219

27,4

13,7%

256

6,912

0,5

0,6

8,012

125

31,2

15,6%

512

13,824

0,5

0,6

14,924

67

33,5

16,8%

So what change should we expect to the above results if we change from synchronous writes to asynchronous writes?

2. Asynchronous writes

Instead of firing one write at a time and waiting for completion before issuing the next one, we’ll stream writes one after the other, leaving no “gap” between consecutive writes.

Three new elements will influence the expected maximum number of I/O streams in the pipe:

  • Channel buffer-to-buffer credits
  • Number of outstanding I/O (if any) the controller can support. This is 32 for example for an HP EVA
  • Number of outstanding I/O (if any) the system, or an scsi target can support. On HP-UX, the default number of I/Os that a single SCSI target will queue up for execution is for example 8, the maximum is 255.

Over 50kms, and knowing that the speed of light in fiber is about 5 microseconds per kilometer, the relationship between the I/O size and the packet size in the pipe is shown in figure 2:

Figure 2: between the I/O size and the packet size in the fiber channel pipe

I/O size

(kB)

Time to load

(µs)

Packet length

(km)

2

10,24

2

32

163,84

33

64

327,68

66

128

655,36

131

256

1310,72

262

512

2621,44

524

The packet length for 2KB writes requires a capacity of 25 outstanding I/Os to fill the 50km pipe, but only one I/O can be active for 128KB packets streams. Again, this statement only holds true if the “space” between frames is negligible.

Assuming a zero-gap between 2KB frames, an observation post would see an I/O pass through every 10µs, which corresponds to 100 000 I/O/s. We are here leaving the replication bottleneck as other limiting factors such as at the storage array and computers at both end will now take precedence. However, a single 128KB packet will be in the pipe at a given time: the next has to wait for the previous to complete. Sounds familiar, doesn’t it ? When the packet size exceeds the window size, replication won’t give any benefit to asynchronous I/O writes, because asynchronous writes behave synchronously.

 

June 26, 2007

Log file write time and the physics of distance

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 7:46 pm

I already wrote a couple of notes about the replication options available when a production is made of different storage arrays (see “Spotlight on Oracle replication options within a SAN (1/2)” and Spotlight on Oracle replication options within a SAN (2/2)).

These posts came from a real life experience, where both storage arrays were “intuitively” close enough to each other to ignore the distance factor. But what if the distance is increased? The trade-off seems obvious: the greater the distance, the lower the maximum performance. But what is the REAL distance factor? Not so bad in theory.

I’m still interested in the first place by synchronous writes, namely log file writes and associated “log file sync” waits. I want to know how distance influences the log file write time in a Volume manager (HP-UX LVM, Symantec VxVM, Solaris VM or ASM) mirroring. EMC SRDF and HP ‘s Continuous Access (XP or EVA) synchronous writes could also be considered but their protocol seems to need 2 round trips per host I/O. I’ll leave this alone pending some more investigation.

The remote cache must in both cases acknowledge the I/O to the local site to allow the LGWR’s I/O to complete.

1. Load time and the zero distance I/O completion time.

Load time:

The speed of light in fiber is about 5 microseconds per kilometer, which means 200km costs 1ms one way. The load time is the time for a packet to completely pass any given point in a SAN. A wider pipe allows a packet to be delivered faster than a narrow pipe.

The load time can also be thought as the length of the packet in kilometers: the greater the bandwidth, the smaller the packet length, and the smaller the packet load time. At 2Gb/s, a 2KB packet (the typical log write size) is about 2kms long, but it would be 2600 km long for a 1.5Mb/s slow link.

Zero distance I/O completion time

The zero distance I/O completion time is made of two components:

  • A fixed overhead, commonly around 0.5 ms (the tests made in the Spotlight on Oracle replication options within a SAN (2/2) and reproduced below on fig.1 corroborates the fact that the I/O time on a local device is only increased by 10% when the packet size more than doubles). This represents storage array processor time and any delay on the host ports for the smallest packet.
  • The load time, a linear function of the packet size.

At the end of the day, the zero distance I/O completion time is :

Slope x Packet size + overhead

Here is one of the measurements I reported in the “Spotlight on Oracle replication post” :

Figure 1 : Measured I/O time as a function of the write size for log file writes

Write size (k) I/O time (ms)
   
2 0,66
5 0,74

A basic calculation gives :

Slope = (5-2)/(0,74-0,66)=0,027
Overhead = 0,6 ms

Figure 2 : Effect of the frame size on zero distance I/O completion time :

Frame size (k)

Time to load

2

0,65

16

1,03

32

1,46

64

2,33

128

4,06

 

A small frame such as a log write will heavily depend upon the overhead, while the slope (which itself is a linear function of the throughput) is predominant for large frames.

2. Synchronous I/O time

The transfer round trip (latency) is the last component of the time to complete a single I/O write over distance. It is equal to

2xDistance (km) x 5µsec/km

Figure 3: Time to complete a 2K synchronous write (in ms)

km

Round trip latency

Time to load

Overhead

Time to complete the log write

10

0,1

0,654

0,6

1,354

20

0,2

0,654

0,6

1,454

30

0,3

0,654

0,6

1,554

40

0,4

0,654

0,6

1,654

50

0,5

0,654

0,6

1,754

60

0,6

0,654

0,6

1,854

70

0,7

0,654

0,6

1,954

80

0,8

0,654

0,6

2,054

90

0,9

0,654

0,6

2,154

100

1

0,654

0,6

2,254

110

1,1

0,654

0,6

2,354

120

1,2

0,654

0,6

2,454

130

1,3

0,654

0,6

2,554

140

1,4

0,654

0,6

2,654

150

1,5

0,654

0,6

2,754

This is quite interesting as the log writes are only about twice as slow when you multiply by 15 the distance.

June 19, 2007

Spotlight on Oracle replication options within a SAN (2/2)

Filed under: Oracle,Solaris,Storage — christianbilien @ 7:57 pm

This post is a follow up to “Spotlight on Oracle replication options within a SAN (1/2)”. This first post was about the available replication options.

I will address in this post a specific performance aspect for which I am very concerned for one of my customers. This is an organization where many performance challenges come down to the commit wait time: the applications trade at the millisecond level which translates in data base log file syncs expressed in hundredth of microseconds. It is a basic DRP requirement that applications must be synchronously replicated over a 2,5 kms (1.5 miles) Fiber Channel network between a local and a remote EMC DMX 1000 storage array. The mutipathing software is Powerpath, the DMX1000 volumes may be mirrored from the local array to the remote by either VxVm, ASM or SRDF.

Two options may be considered:

  • Host based (Veritas VxVM, Solaris Disk Suite or ASM) replication
  • Synchronous SRDF replication

All options may not always be available as RAC installations over the two sites will require a host based replication. On the other hand, simple replication with no clustering may either use SRDF of a volume manager replication.

I made some unitary tests aimed at qualifying the SRDF protocol vs. a volume manager replication. Let us just recall that an SRDF mirrored I/O will go in the local storage array cache, and will be acknowledged to the calling program only when the remote cache has been updated. A VM is no less different in principle: the Powerpath policy dictates that both storage arrays must acknowledge the I/O before the calling program considers it is completed.

Test conditions:

  • This is a unitary test. It is not designed to reflect an otherwise loaded or saturated environment. The conclusion will however shed some light on what’s happening under stress.
  • This test is specifically designed to show what’s happening when performing intense log file writes. The log file write size is usually 2k, but I saw it going up to 5k.
  • The test is a simple dd if=/dev/zero of=<target special file> bs=<block size> count=<count>. Reading from /dev/zero ensures that no read waits occurs.

Baseline: Local Raw device on a DMX 1000
Throughput=1 powerpath link throughput x 2

Block size (k)

I/O/s

I/O time (ms)

MB/s

2

1516

0,66

3,0

5

1350

0,74

6,6

Test 1: Distant Raw device on a DMX
Throughput=1 powerpath link throughput x 2

Block size (k)

I/O/s

I/O time (ms)

MB/s

2

1370

0,73

2,7

5

1281

0,78

6,3

The distance degradation is less than 10%. This is the I/O time and throughput I expect when I mirror the array volumes by VxVM or ASM.

Test 2: Local raw device on a DMX, SRDF mirrored
Throughput=1 powerpath link throughput x 2

Block size (k)

I/O/s

I/O time (ms)

MB/s

2

566

1,77

1,1

5

562

1,78

2,7

This is where it gets interesting: SRDF will double the I/O time and halve the throughput.

 

Conclusion: When you need log file write performance in order to minimize the log file sync wait times, use a volume manager (including ASM) rather than SRDF. I believe this kind of result can also be expected under either the EVA or XP Continuous Access. The SRDF mirrored I/O are even bound to be more impacted by an increasing write load on the storage arrays as mirroring is usually performed via dedicated ports, which bear the load of all of the writes sent to the storage array. This bottleneck does not exist for the VM replication.

 

 

June 14, 2007

Spotlight on Oracle replication options within a SAN (1/2)

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 7:40 pm

Some interesting issues face the many sites wishful to implement a replication for data bases between two distant sites. One of the major decisions to be taken is HOW the replication will be performed, in other words what are the options and their pro and cons? I’ll start with generalities and then present some unitary tests performed in a Solaris/ASM/VxVM/EMC DMX environment.

1. The initial consideration is synchronous vs. asynchronous replication.

Synchronous

  • Synchronous means that the I/O has to be posted on the remote site for the transaction to be validated. Array based replications, such as HP’s Continuous Access or EMC’s SRDF will post the I/O from the local array cache to the remote, then wait for the ack to come back before acknowledging the I/O to the calling program. The main component in the overall response is the times it takes to write from the local cache to the remote cache and for the acknowledgment to come back. This latency is of course not felt by read accesses, but write time is heavily impacted (see the tests at the bottom of this post). The applications heavily waiting on “log file sync” events are the most sensitive to the synchronous write mechanism. I am preparing a post about the distance factor, i.e. how distance impacts response times.
  • Another aspect of synchronous replication is the bottleneck the replication will go through. Assuming a couple of 2GB/s replication ports, the replication bandwidth will be 4GB/s. It will need to accommodate the whole storage array write throughput, thereby potentially increasing the latency because processors will be busier, I/O will wait on array cache flushes and on other latches, etc.

Asynchronous

To preserve consistency, asynchronous replication must implement some sequence-stamping that ensures that write operations at the remote node occur in the correct order. Loss may thus occur with EMC SRDF/A (A stands for adaptive) or HP’s CA asynchronous, but no data corruption should be experimented.

2. Host based vs. array based replication

Data Guard and volume managers (including the ASM) can be used to mirror the data base volumes from one array to the other one.

Data Guard

Data Guard works over TCP/IP.

Pro:

  • IP links are common, relatively cheap and easy to set up.

Cons:

  • Synchronous replication over IP means QOS (Quality Of Service) procedures to avoid other services clogging the links.
  • The commits must wait for the writes in the remote log file. The remote data base is asynchronously loaded from the remote log files. The more DML intensive the primary data base is, the wider the potential gap.

Volume management

Volume management is the only available options for some geographical clusters. RAC over Sun Cluster, RAC over ASM without 3rd party clusters, Mc ServiceGuard with the Cluster File System do not offer any other alternative (take a look at RAC geographical clusters and 3rd party clusters (HP-UX) for a discussion of RAC on geo clusters.

ASM is a also a volume manager as it is used for mirroring from one storage array to the other.

Pro:

  • Fast (see the unitary tests). They also work best on aggregate: all of the storage array replicated writes go through a set of dedicated ports, which ends up bottlenecking on some array processors when others are mostly idle. VM writes are spread all over the array processors. So both scalability and unitary write speed are in favor of volume management mirroring.

Cons:

  • Harder to manage and to maintain. Say that you want to configure an ASM with a lot of raid groups. Assuming the power_limit set to 0 prevents the automatic rebuild of the mirrored raid group because the rebuild would otherwise occur locally, you’ll have to add the newly created raid group into the rebuild script. Worse, you may forget it and realize one raid group is not mirrored the day the primary storage array fails. The most classic way to fail a cluster switchover is to forget to reference newly created file systems or tablespaces.
  • Usually works over Fiber Channel, although FC-IP can be used to extend the link distance.
  • No asynchronous replication except for the Veritas Volume Replicator which is to my knowledge the only VM able to perform async writes on a remote array.

Array based replication

Pro:

  • Usually easier to manager. The maintenance and switchover tasks may also be offloaded on the storage team. Host based replication management either puts the ball in the DBA camp (if using ASM) or to the sys admins (for other VM).
  • Asynchronous replication
  • Vendors offer remote monitoring
  • Snapshots can be made on the distant sites for development, report or other purposes.

Cons:

  • Performance as seen above.
  • Same limitations with the Fiber Channel.

 

May 25, 2007

Oracle ISM and DISM: more than a no paging scheme (2/2)… but be careful with Solaris 8

Filed under: Oracle,Solaris — christianbilien @ 9:39 pm

This post is the DISM follow up to the ISM-only Oracle ISM and DISM: Oracle ISM and DISM: more than a no paging scheme (1/2).

DISM (Dynamic Intimate Shared Memory) is the pageable variant of ISM. DISM was made available on Solaris 8. The DISM segment is attached to a process through the shmat system call. SHM_DYNAMIC is a new flag that tells shmat to create Dynamic ISM rather than the SHM_SHARE_MMU flag used for ISM.

DISM is like ISM except that it isn’t automatically locked. The application, not the kernel does the locking, which is done by using mlock. Kernel virtual-to-physical memory address translation structures are shared among processes that attach to the DISM segment. This is one of the DISM benefits: saving kernel memory and CPU time. As with ISM, shmget creates the segment. The shmget size specified is the maximum size of the segment. The size of the segment can be larger than physical memory. Enough of disk swap should be made available to cover the maximum possible DISM size.

Per the Oracle 10gR2 installation guide on Solaris platforms:

Oracle Database automatically selects ISM or DISM based on the following criteria:

  • Oracle Database uses DISM if it is available on the system, and if the value of the SGA_MAX_SIZE initialization parameter is larger than the size required for all SGA components combined. This enables Oracle Database to lock only the amount of physical memory that is used.
  • Oracle Database uses ISM if the entire shared memory segment is in use at startup or if the value of the SGA_MAX_SIZE parameter is equal to or smaller than the size required for all SGA components combined. 

I ran a few logical I/O intensive tests aimed at highlighting some possible performance loss when moving from ISM to DISM (as pages are not permanently locked in memory, swap management has to be invoked), but I couldn’t find any meaningful difference. Most of the benefits I described in the Oracle ISM and DISM: more than a no paging scheme (1/2) post still applies, except for the non-support of large pages in Solaris 8 (see below).

Since DISM requires the application to lock memory, and since memory locking can only be carried out by applications with superuser privileges, the $ORACLE_HOME/bin/oradism daemon run as root using setuid (early 9i releases had a different mechanism, using RBAC instead of setuid).

Solaris 8 problems:

Dynamic Intimate Shared Memory (DISM) was introduced in the 1/01 release of Solaris 8 (Update 3). DISM was supported by Oracle9i for SGA resizing.

On a 10gR2 database running on Solaris 10, it can be seen than large pages are used by DISM :

pmap -sx 19609| more

19609: oracleSID11 (LOCAL=NO)

Address Kbytes RSS Anon Locked Pgsz Mode Mapped File
0000000380000000 16384 16384 4M rwxs- [ dism shmid=0x70000071 ]

Per the following Sun Solve note http://sunsolve.sun.com/search/document.do?assetkey=1-9-72952-1&searchclause=dism%2420large%2420page

“In this first release, large MMU pages were not supported. For Solaris 8 systems with 8GB of memory or less, it is reasonable to expect a performance degradation of up to 10% compared to ISM, due to the lack of large page support in DISM […] Sun recommends avoiding DISM on Solaris 8 either where SGAs are greater than 8 Gbytes in size, or on systems with a typical CPU utilization of 70% or more. In general, where performance is critical, DISM should be avoided on Solaris 8. As we will see, Solaris 9 Update 2 (the 12/02 release) is the appropriate choice for using DISM with systems of this type.”

http://www.sun.com/blueprints/0104/817-5209.pdf from Sun advocates on Solaris 8 the use of DISM primarily for the machine maintenance, such as removing a memory board, but it fails to mention that large MMU pages are not supported.

May 14, 2007

Oracle ISM and DISM: more than a no paging scheme (1/2)

Filed under: Oracle,Solaris — christianbilien @ 12:54 pm

This post only deals with ISM. I’ll write second one about Dynamic ISM (DISM) .

A long standing problem on any platform has been the probability that part of the Oracle memory segment gets swapped out and that what is a relatively memory fast access turns into a horrid bottleneck. Oracle 9i on Solaris made use of an interesting feature named Intimate Shared Memory (ISM) which in fact makes a lot more than what one may think of initially.

The very first benefit of ISM (not DISM for the time being) is that the shared memory is locked by the kernel when the segment is created: the memory cannot be paged out. A small price to pay to the locking mechanism is that sufficient available unlocked memory must exist for the allocation to succeed.

Because the SHM_SHARE_MMU flag is set in the shmat system call to set up the shared segment as ISM, there are less known benefits, which may be of a higher importance than the no paging scheme on CPU bounds systems.

 

Shared kernel virtual-to-physical translation

The virtual to physical mapping is one of the most consuming tasks any modern operating system has to perform. The hardware Translation Lookaside buffer (TLB) is a physical cache to the slower in-memory tables. The Translation Storage Buffer (TSB) is a further translation in memory cache. As even in Solaris 10 the standard System V algorithm is still to have a private virtual address space for each process, aliasing (several virtual addresses exist that map to the same physical address).

ISM allows the sharing of kernel virtual-to-physical memory between processes that attach to the shared memory, saving considerable translation slots in the hardware TLB. This can be monitored on Solaris 10 by trapstat:

# trapstat -T

cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim

———-+——————————-+——————————-+—-

512 u 8k| 1761 0.1 2841 0.2 | 2594 0.1 2648 0.2 | 0.5

512 u 64k| 0 0.0 0 0.0 | 8 0.0 0 0.0 | 0.0

512 u 512k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

512 u 4m| 20 0.0 1 0.0 | 4 0.0 0 0.0 | 0.0

512 u 32m| 0 0.0 0 0.0 | 11 0.0 0 0.0 | 0.0

512 u 256m| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

trapstat show both instruction and data misses in both the TLB and the TSB.

Solaris 8 does not have trapstat, so the trick is to use cpustat:

On a non-idle Oracle system using ISM as seen below,

mpstat 5 5

CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl

0 0 0 282 728 547 1842 283 329 62 10 3257 40 8 25 27

1 0 0 122 227 2 1954 284 327 55 9 3639 39 6 29 26

2 0 0 257 1578 1399 1887 288 330 58 9 3287 35 11 27 27

3 1 0 313 1758 1501 1933 285 328 70 12 3437 36 8 29 27

cpustat -c pic0=Cycle_cnt,pic1=DTLB_miss 1

time cpu event pic0 pic1

1.010 3 tick 192523799 29658

1.010 2 tick 270995815 28499

1.010 0 tick 225156772 29621

1.010 1 tick 234603152 29034

psrinfo –v

Status of processor 3 as of: 05/14/07 12:48:53
Processor has been on-line since
03/11/07 10:35:22.
The sparcv9 processor operates at 1062 MHz,
and has a sparcv9 floating point processor.

cpustat shows that on processor 3, we have 29658 dTLB misses on this sample. UltraSparcIII will use somewhere between 50 cycles (most favourable case: no TLB entry miss) and 300 cycles (worst case: a memory load has to be performed to compute the translation) to handle dTLB accesses. It will take in the best scenario 1.5 million cycles per seconds and 8.9 millions in the worst to handle the misses. At 1062Mhz, the time spent handling dTLB misses is only between 0.14% and 0.84% !

Large pages.

From Solaris 2.6 through Solaris 8, large pages are only available through the use of the ISM (using SHM_SHARE_MMU).

Solaris 8

pagesize
8192

Solaris 10 :default pagesize

pagesize
8192

Supported page sizes:

pagesize -a
8192

65536

524288

4194304

ISM page size on Solaris 10 (look at the pgsz column). It looks like Oracle is using the largest page available

 

pmap -sx 25921

25921: oracleSID1 (LOCAL=NO)

Address Kbytes RSS Anon Locked Pgsz Mode Mapped File

00000001064D2000 24 24 24 8K rwx– [ heap ]

00000001064D8000 32 8 – rwx– [ heap ]

0000000380000000 1048576 1048576 1048576 4M rwxsR [ ism shmid=0x6f000078 ]

AMD 64/x64. The AMD Opteron processor supports both 4Kbyte and 2Mbyte page sizes:

pagesize -a

4096

2097152

x86. The implementation of Solaris on x86 processors provides support for 4Kbyte pages only.

 

This post will be followed up by a discussion about DISM, the differences with ISM and a word of caution about using DISM on Solaris 8:
Oracle ISM and DISM: more than a no paging scheme…but be careful with Solaris 8 (2/2)

April 17, 2007

RAC geographical clusters and 3rd party clusters (Sun Solaris) (1/3)

Filed under: Oracle,RAC,Solaris — christianbilien @ 9:06 pm

As a word of introduction a geographical RAC cluster is a RAC where at least one node is physically located in a remote location, and DB access is still available should one of the sites go down.

I found that many customers wishing to implement a RAC geo cluster get confused by vendors when it comes to the RAC relationships (or should I say dependencies) with third party clusters. I also have the impression that some Oracle sales rep tend to participate to this confusion by encouraging troubled prospects in one way or in another, depending of their particular interest with a hardware/cluster 3rd party provider.

Let’s first say that I am here just addressing the RAC options. Assuming some other applications need clustering services, a third party cluster will be necessary (although some provisions, still in infancy, exist within the CRS to “clusterize” non-RAC services). I’ll also deliberately not discuss NAS storage as I never had the opportunity to work or even consider a RAC/NAS option (Pillar, NetApp, and a few others are trying to get into this market).

This first post is about RAC geo clusters on Solaris. RAC geo clusters on HP-UX will be covered here.

The Solaris compatibility matrix is located at https://metalink.oracle.com/metalink/plsql/f?p=140:1:2790593111784622179

I consider two cluster areas to be strongly impacted by the “third party cluster or not” choice: storage and membership strategy. Some may also argue about private interconnect protection against failure, but since IPMP may be used for the RAC-only option, and although some technical differences exist, I think that this is a matter of much less importance that storage and membership.

Storage:

  • 10gR1 was very special as it did not have any Oracle protection for the vote and ocr volumes. This lack of functionality had a big impact on geo clusters as some third party storage clustering was required for vote and OCR mirroring.
  • 10gR2: The options may not be the same for OCR/vote, data base files, binary and archivelog files. Although archivelog files on a clustered file system saves NFS mounts, binary and archivelogs may usually be located on their own “local” file system which may on the array, but only seen from one node. The real issues are on one hand the DB files, on the other hand the OCR and voting disk which are peculiar because they must be seen when the CRS starts, BEFORE the ASM or any Oracle dependent process can be started.
  • RAC+ Sun Cluster (SCS): The storage can either be a Solaris volume manager and raw devices or QFS, GFS is not supported. ASM may be used but offers little in my opinion compared to a volume manager. ASM used for mirroring suffers from the mirroring reconstruction that has to be performed when one of site is lost and the lack of any feature similar to a copy of modified blocks only (the way storage mirroring does).
  • RAC + Veritas Cluster Services (VCS): the Veritas cluster file system (the VxFs cluster version), running over the Cluster Volume Manager (the VxVm cluster version) is certainly a good solutions for those adverse to raw device/ASM. All of the Oracle files, including OCR and vote can be put on the CFS. This is because the CFS can be brought up before the CRS starts.
  • RAC without any third party cluster: ASM has to be used for storage mirroring. This is easier to manage and cheaper, although mirrored disk group reconstruction is a concern when volumes are high. I also like not to avoid the coexistence of two clusters (RAC on top of SCS or VCS).

Membership, split brain and amnesia

A number of membership issues are addressed differently by SCS/VCS and the CRS/CSS. It is beyond the scope of this post to explain fencing, split brain and amnesia. There are really two worlds here: on one hand, Oracle has a generic clusterware membership system across platforms, which avoids system and storage dependency, on the other hand VCS and SCS take advantage of SCSI persistent reservation ioctls. Veritas and Sun both advocate that Oracle’s node eviction strategy may create situations in which a node would be evicted from the cluster, but not forced to the boot yet. Other instances may then start recovering instances while the failed instance stills write to the shared storage. Oracle says that database corruption is prevented by using the the voting disk, network, and the control file to determine when a remote node is down. This is done in different, parallel, independent ways. I am not going to enter the war on one side or another, let’s just recall the basic strategies:

  • CSS: this process uses both the interconnects and the voting disks to monitor remote node. A node must be able to access strictly more than half of the voting disks at any time (this is the reason for the odd number of voting disks), which prevents split brain. The css miscount is 30s, which is the network heartbeat time allowance for not responding before eviction.
  • Both VCS and SCS use SCSI3 persistent reservation via ioctl, and I/O fencing to prevent corruption. Each node registers a key (it is the same for all the node paths). Once node membership is established, the registration keys of all the nodes that do not form part of the cluster are removed by the surviving nodes of the cluster. This blocks write access to the shared storage from evicted nodes.

One last bit: although not a mainstream technology (and it won’t improve now that RDS over Infiniband is an option on Linux and soon on Solaris), I believe SCS is needed to allow RSM over SCI/ SunFire Link to be used. The specs show quite an impressive latency of a few micro seconds.

Next Page »

Create a free website or blog at WordPress.com.