Christian Bilien’s Oracle performance and tuning blog

November 17, 2007

A short, but (hopefully) interesting walk in the HP-UX magic garden

Filed under: HP-UX,Oracle — christianbilien @ 4:57 pm

While many in the Oracle blogosphere seemed to be part of a huge party in San Francisco, I was spending some time researching a LGWR scheduling issues on HP-UX for a customer. While working on it, I thought that a short walk in the magic garden may be of interest to some readers (the Unix geeks will remember in the 90”s “The magic garden explained: The Internals of Unix System V Release 4”. It is amazingly still sold by Amazon though second hand copies only are available).

I already blogged on HP-UX scheduling :hpux_sched_noage and hp-ux load balancing on SMPs.

I work with q4, a debugger I use extensively when I teach at HP HP-UX performance or HP-UX internals. This first post on the HP-UX internals will take us on a walk into the process and thread structure, where I’ll highlight the areas of interest to scheduling.

Let’s first copy the kernel to another file before preparing it for debugging:

$cp /stand/vmunix /stand/vmunix.q4
$q4pxdb /stand/vmunix.q4

q4 can then be invoked:

$q4 /dev/mem/stand/vmunix.q4

(or use ied to enable command recall and editing):

$ied -h $HOME/.q4_history q4 /dev/mem /stand/vmunix.q4)

@(#) q4 $Revision: 11.X B.11.23l Wed Jun 23 18:05:11 PDT 2004$ 0
see help topic q4live regarding use of /dev/mem
Reading kernel symbols …
Reading debug information…
Reading data types …
Initialized ia64 address translator …
Initializing stack tracer for EM…
[…]

Then let’s load the whole proc structures from the proc table. Unlike older versions, nproc is not the size of hash table but rather the maximum size of a linked list.

q4> load struct proc from proc_list max nproc next p_factp

Here are some fields of interest to the scheduling policy:

p_stat  SINUSE => proc entry in use
p_pri  0x298 => default priority
p_nice  0x14 => nice value
p_ticks   0xf8113a => Number of clock ticks charged to this process
p_schedpolicy  0x2 => Scheduling policy (real time rtsched or rtprio, etc.)

p_pri and p_nice are seen in ps, top, glance, etc.

p_firstthreadp  0xe0000001345d10c0  and p_lastthreadp  0xe0000001345d10c0 

are the backward and forward pointer links to the other threads of the process. They are identical here because the process is single-threaded.

We’ll just keep one process to watch:

q4> keep p_pid == 5252

and we can now load this process thread list:

q4> load struct kthread from kthread_list max nkthread next kt_factp

Here are some of the fields I know of related to scheduling or needed for accessing the next thread:

kt_link  0 => forward run/sleep queue link
kt_rlink  0 => backward run queue link
kt_procp  0xe00000012ffd2000 => a pointer to its proc  structure
kt_nextp  0 => next thread in the same process (no next thread here)
kt_prevp  0 => previous thread in the same process (no previous thread here)
kt_wchan  0xe0000001347d64a8 => sleep wait channel pointer
kt_stat  TSSLEEP => thread status. Not much to do !
kt_cpu  0 => Charge the running thread with cpu time accumulated
kt_spu  0 => the spu number of the run queue that the thread is currently on
kt_spu_wanted  0 => the spu it would like to be on (a context switch will happen if kt_spu  <> kt_spu_wanted)
kt_schedpolicy  0x2 => scheduling policy for the thread
kt_ticksleft   0xa => number of ticks left before a voluntary timeslice will be requested

kt_usrpri  0x2b2
 and  kt_pri  0x29a

are equal while the thread are running in the timeshare scheduling mode. As the thread is sleeping on priority kt_pri, kt_usrpri will contain the re-calculated priorities. On wakeup, the value of kt_usrpri is copied to kt_pri.

 

 

 

August 16, 2007

Workload characterization for the uninformed capacity planner

Filed under: HP-UX,Models and Methods,Oracle,Solaris,Storage — christianbilien @ 7:32 pm

Doug Burns initiated an interesting thread a while ago about user or application workloads, their meanings and the difficulties associated with their determination. But workload characterization is both essential and probably the hardest and most prone to error bit off the whole forecasting process. Models that fail to validate (i.e. are not usable) most of the time fall in one of these categories:

  • The choice of characteristics and parameters is not relevant enough to describe the workloads and their variations
  • The analysis and reduction of performance data was incorrect
  • Data collection errors, misinterpretations, etc.

Unless you already know the business environment and the applications, or some previous workload characterization is already in place, you are facing a blank page. You can always try to do the smart workload partition along functional lines, but this effort is unfortunately often preposterous and doomed to failure because of time constraints. So what can be done?

I find the clustering analysis a good compromise between time to deliver and business transactions. Caveat: this method ignores any data cache (storage array, Oracle and File System cache, etc.) and locks/latches or any other waits unrelated to resource waits.

A simple example will explain how it works:

Let’s assume that we have a server with a single CPU and a single I/O path to a disk array. We’ll represent each transaction running on our server by a couple of attributes: the service time each of these transactions requires from the two physical resources.In other words, each transaction will require in absolute terms a given number of seconds of presence on the disk array and another number of seconds on the CPU. We’ll call a required serviced time a “demand on a service center” to avoid confusion. The sum of those two values would represent the response time on an otherwise empty system assuming no interaction occurs with any other external factor. As soon as you start running concurrent transactions, you introduce on one hand waits on locks, latches, etc. and on the other hand queues on the resources: the sum of the demands is no longer the response time. Any transaction may of course visit each resource several times: the sum of the times spent using each service center will simply equal the demand.

Let us consider that we are able to collect the demands each single transaction j requires from our two resource centers. We’ll name
{D}_{j1} the CPU demand and {D}_{j2} the disk demand of transaction j. Transaction j can now be represented by a two components workload: {w}_{j}=({D}_{j1},{D}_{j2}). Let’s now start the collection. We’ll collect overtime every {w}_{j} that goes on the system. Below is a real 300 points collection on a Windows server. I cheated a little bit because there are four CPUs on this machine but we’ll just say a single queue represents the four CPUs.

T1

The problem is now obvious: there is no natural grouping of transactions with similar requirements. Another attempt can be made using Neperian logs to distort the scales:

t2.gif

This is not good enough either to identify meaningful workloads.

The Minimum Spanning Tree (MST) method can be used to perform successive fusions of data until the wanted number of representative workloads is obtained. It begins by considering each component of a workload to be a cluster of points. Next, the two clusters with the minimum distance are fused to form a cluster. The process iterates until the final number of desired clusters is reached.

  • Distance: let’s assume two workloads represented by {w}_{i}=({D}_{i1},{D}_{i2},...,{D}_{iK}) and {w}_{j}=({D}_{j1},{D}_{j2},...,{D}_{jK}). I moved from just two attributes per workload to K attributes, which will correspond to service times at K service centers. The Euclidian distance between the two workloads will be d=\sqrt[]{\sum_{n=1}^{K}({D}_{in}-{D}_{jK})}.
  • Each cluster is represented at each iteration by its centroid whose parameter values are the means of the parameter values of all points in the cluster.

    Below is a 20 points reduction of the 300 initial points. In real life, thousands of points are used to avoid outliers and average the transactions

    t3.gif

July 2, 2007

Asynchronous checkpoints (db file parallel write waits) and the physics of distance

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 5:15 pm

The first post ( “Log file write time and the physics of distance” ) devoted to the physic of distance was targeting log file writes and “log file sync” waits. It assumed that :

  • The percentage of occupied bandwidth by all the applications which share the pipe was negligible
  • No other I/O subsystem waits were occurring.
  • The application streams writes, i.e. it is able to issue an I/O as soon as the channel is open.

This set of assumptions is legitimate if indeed an application is “waiting” (i.e. not consuming cpu) on log file writes but not on any other I/O related events and the fraction of available bandwidth is large enough for a frame not to be delayed by another applications which share the same pipe, such as an array replication.

Another common Oracle event is the checkpoint completion wait (db file parallel write). I’ll try to explore in this post how the replication distance factor influences the checkpoint durations. Streams of small transactions make the calling program synchronous from the write in the logfile, but checkpoints writes are much less critical by nature because they are asynchronous from the user program perspective. They only influence negatively the response time when “db file parallel write” waits start to appear. The word “asynchronous” could be a source of confusion, but it is not here. The checkpoints I/Os are doubly asynchronous, because the I/Os are also asynchronous at the DBWR level.

1. Synchronous writes: relationship of I/O/s to throughput and percent bandwidth

We did some maths in figure 3 in “Log file write time and the physics of distance” aimed at calculating the time to complete a log write. Let’s do the same with larger writes over a 50km distance on a 2Gb/s FC link. We’ll also add a couple of columns: the number of I/O/s and the fraction of used bandwidth. 2Gb/s = 200MB/s because the FC frame is 10 bytes long.

 

Figure 1: throughput and percent bandwidth as a function of the I/O size (synchronous writes)

I/O size

Time to

load (ms)

Round trip

latency (ms)

Overhead(ms)

Time to complete

an I/O (ms)

IO/s

Throughput

(MB/s)

Percent

bandwidth

2

0,054

0,5

0,6

1,154

867

1,7

0,8%

16

0,432

0,5

0,6

1,532

653

10,2

5,1%

32

0,864

0,5

0,6

1,964

509

15,9

8,0%

64

1,728

0,5

0,6

2,828

354

22,1

11,1%

128

3,456

0,5

0,6

4,556

219

27,4

13,7%

256

6,912

0,5

0,6

8,012

125

31,2

15,6%

512

13,824

0,5

0,6

14,924

67

33,5

16,8%

So what change should we expect to the above results if we change from synchronous writes to asynchronous writes?

2. Asynchronous writes

Instead of firing one write at a time and waiting for completion before issuing the next one, we’ll stream writes one after the other, leaving no “gap” between consecutive writes.

Three new elements will influence the expected maximum number of I/O streams in the pipe:

  • Channel buffer-to-buffer credits
  • Number of outstanding I/O (if any) the controller can support. This is 32 for example for an HP EVA
  • Number of outstanding I/O (if any) the system, or an scsi target can support. On HP-UX, the default number of I/Os that a single SCSI target will queue up for execution is for example 8, the maximum is 255.

Over 50kms, and knowing that the speed of light in fiber is about 5 microseconds per kilometer, the relationship between the I/O size and the packet size in the pipe is shown in figure 2:

Figure 2: between the I/O size and the packet size in the fiber channel pipe

I/O size

(kB)

Time to load

(µs)

Packet length

(km)

2

10,24

2

32

163,84

33

64

327,68

66

128

655,36

131

256

1310,72

262

512

2621,44

524

The packet length for 2KB writes requires a capacity of 25 outstanding I/Os to fill the 50km pipe, but only one I/O can be active for 128KB packets streams. Again, this statement only holds true if the “space” between frames is negligible.

Assuming a zero-gap between 2KB frames, an observation post would see an I/O pass through every 10µs, which corresponds to 100 000 I/O/s. We are here leaving the replication bottleneck as other limiting factors such as at the storage array and computers at both end will now take precedence. However, a single 128KB packet will be in the pipe at a given time: the next has to wait for the previous to complete. Sounds familiar, doesn’t it ? When the packet size exceeds the window size, replication won’t give any benefit to asynchronous I/O writes, because asynchronous writes behave synchronously.

 

June 26, 2007

Log file write time and the physics of distance

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 7:46 pm

I already wrote a couple of notes about the replication options available when a production is made of different storage arrays (see “Spotlight on Oracle replication options within a SAN (1/2)” and Spotlight on Oracle replication options within a SAN (2/2)).

These posts came from a real life experience, where both storage arrays were “intuitively” close enough to each other to ignore the distance factor. But what if the distance is increased? The trade-off seems obvious: the greater the distance, the lower the maximum performance. But what is the REAL distance factor? Not so bad in theory.

I’m still interested in the first place by synchronous writes, namely log file writes and associated “log file sync” waits. I want to know how distance influences the log file write time in a Volume manager (HP-UX LVM, Symantec VxVM, Solaris VM or ASM) mirroring. EMC SRDF and HP ‘s Continuous Access (XP or EVA) synchronous writes could also be considered but their protocol seems to need 2 round trips per host I/O. I’ll leave this alone pending some more investigation.

The remote cache must in both cases acknowledge the I/O to the local site to allow the LGWR’s I/O to complete.

1. Load time and the zero distance I/O completion time.

Load time:

The speed of light in fiber is about 5 microseconds per kilometer, which means 200km costs 1ms one way. The load time is the time for a packet to completely pass any given point in a SAN. A wider pipe allows a packet to be delivered faster than a narrow pipe.

The load time can also be thought as the length of the packet in kilometers: the greater the bandwidth, the smaller the packet length, and the smaller the packet load time. At 2Gb/s, a 2KB packet (the typical log write size) is about 2kms long, but it would be 2600 km long for a 1.5Mb/s slow link.

Zero distance I/O completion time

The zero distance I/O completion time is made of two components:

  • A fixed overhead, commonly around 0.5 ms (the tests made in the Spotlight on Oracle replication options within a SAN (2/2) and reproduced below on fig.1 corroborates the fact that the I/O time on a local device is only increased by 10% when the packet size more than doubles). This represents storage array processor time and any delay on the host ports for the smallest packet.
  • The load time, a linear function of the packet size.

At the end of the day, the zero distance I/O completion time is :

Slope x Packet size + overhead

Here is one of the measurements I reported in the “Spotlight on Oracle replication post” :

Figure 1 : Measured I/O time as a function of the write size for log file writes

Write size (k) I/O time (ms)
   
2 0,66
5 0,74

A basic calculation gives :

Slope = (5-2)/(0,74-0,66)=0,027
Overhead = 0,6 ms

Figure 2 : Effect of the frame size on zero distance I/O completion time :

Frame size (k)

Time to load

2

0,65

16

1,03

32

1,46

64

2,33

128

4,06

 

A small frame such as a log write will heavily depend upon the overhead, while the slope (which itself is a linear function of the throughput) is predominant for large frames.

2. Synchronous I/O time

The transfer round trip (latency) is the last component of the time to complete a single I/O write over distance. It is equal to

2xDistance (km) x 5µsec/km

Figure 3: Time to complete a 2K synchronous write (in ms)

km

Round trip latency

Time to load

Overhead

Time to complete the log write

10

0,1

0,654

0,6

1,354

20

0,2

0,654

0,6

1,454

30

0,3

0,654

0,6

1,554

40

0,4

0,654

0,6

1,654

50

0,5

0,654

0,6

1,754

60

0,6

0,654

0,6

1,854

70

0,7

0,654

0,6

1,954

80

0,8

0,654

0,6

2,054

90

0,9

0,654

0,6

2,154

100

1

0,654

0,6

2,254

110

1,1

0,654

0,6

2,354

120

1,2

0,654

0,6

2,454

130

1,3

0,654

0,6

2,554

140

1,4

0,654

0,6

2,654

150

1,5

0,654

0,6

2,754

This is quite interesting as the log writes are only about twice as slow when you multiply by 15 the distance.

June 14, 2007

Spotlight on Oracle replication options within a SAN (1/2)

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 7:40 pm

Some interesting issues face the many sites wishful to implement a replication for data bases between two distant sites. One of the major decisions to be taken is HOW the replication will be performed, in other words what are the options and their pro and cons? I’ll start with generalities and then present some unitary tests performed in a Solaris/ASM/VxVM/EMC DMX environment.

1. The initial consideration is synchronous vs. asynchronous replication.

Synchronous

  • Synchronous means that the I/O has to be posted on the remote site for the transaction to be validated. Array based replications, such as HP’s Continuous Access or EMC’s SRDF will post the I/O from the local array cache to the remote, then wait for the ack to come back before acknowledging the I/O to the calling program. The main component in the overall response is the times it takes to write from the local cache to the remote cache and for the acknowledgment to come back. This latency is of course not felt by read accesses, but write time is heavily impacted (see the tests at the bottom of this post). The applications heavily waiting on “log file sync” events are the most sensitive to the synchronous write mechanism. I am preparing a post about the distance factor, i.e. how distance impacts response times.
  • Another aspect of synchronous replication is the bottleneck the replication will go through. Assuming a couple of 2GB/s replication ports, the replication bandwidth will be 4GB/s. It will need to accommodate the whole storage array write throughput, thereby potentially increasing the latency because processors will be busier, I/O will wait on array cache flushes and on other latches, etc.

Asynchronous

To preserve consistency, asynchronous replication must implement some sequence-stamping that ensures that write operations at the remote node occur in the correct order. Loss may thus occur with EMC SRDF/A (A stands for adaptive) or HP’s CA asynchronous, but no data corruption should be experimented.

2. Host based vs. array based replication

Data Guard and volume managers (including the ASM) can be used to mirror the data base volumes from one array to the other one.

Data Guard

Data Guard works over TCP/IP.

Pro:

  • IP links are common, relatively cheap and easy to set up.

Cons:

  • Synchronous replication over IP means QOS (Quality Of Service) procedures to avoid other services clogging the links.
  • The commits must wait for the writes in the remote log file. The remote data base is asynchronously loaded from the remote log files. The more DML intensive the primary data base is, the wider the potential gap.

Volume management

Volume management is the only available options for some geographical clusters. RAC over Sun Cluster, RAC over ASM without 3rd party clusters, Mc ServiceGuard with the Cluster File System do not offer any other alternative (take a look at RAC geographical clusters and 3rd party clusters (HP-UX) for a discussion of RAC on geo clusters.

ASM is a also a volume manager as it is used for mirroring from one storage array to the other.

Pro:

  • Fast (see the unitary tests). They also work best on aggregate: all of the storage array replicated writes go through a set of dedicated ports, which ends up bottlenecking on some array processors when others are mostly idle. VM writes are spread all over the array processors. So both scalability and unitary write speed are in favor of volume management mirroring.

Cons:

  • Harder to manage and to maintain. Say that you want to configure an ASM with a lot of raid groups. Assuming the power_limit set to 0 prevents the automatic rebuild of the mirrored raid group because the rebuild would otherwise occur locally, you’ll have to add the newly created raid group into the rebuild script. Worse, you may forget it and realize one raid group is not mirrored the day the primary storage array fails. The most classic way to fail a cluster switchover is to forget to reference newly created file systems or tablespaces.
  • Usually works over Fiber Channel, although FC-IP can be used to extend the link distance.
  • No asynchronous replication except for the Veritas Volume Replicator which is to my knowledge the only VM able to perform async writes on a remote array.

Array based replication

Pro:

  • Usually easier to manager. The maintenance and switchover tasks may also be offloaded on the storage team. Host based replication management either puts the ball in the DBA camp (if using ASM) or to the sys admins (for other VM).
  • Asynchronous replication
  • Vendors offer remote monitoring
  • Snapshots can be made on the distant sites for development, report or other purposes.

Cons:

  • Performance as seen above.
  • Same limitations with the Fiber Channel.

 

April 19, 2007

RAC geographical clusters and 3rd party clusters (HP-UX) (2/3)

Filed under: HP-UX,Oracle,RAC — christianbilien @ 12:38 pm

This is the second post in a series of 3 related to geographical clusters (the first post was focusing on Solaris RAC, the last will be about cluster features such as fencing, quorum, cluster lock, etc.). It would be beneficial to read the first geo cluster post which aimed at Solaris as many issues, common to HP-UX and Solaris, will not be rehashed.

Introduction

Solaris: to make my point straight, I think that the two major differences between the group of generic clusters Sun Cluster Services/ Veritas Cluster Services and the Oracle CRS clusterware are storage and membership. Their nature is however not the same: the storage chapter is essentially related to available options, with ASM being used as the sole option for mirroring in CRS only clusters. This may be a major drawback for large databases. Membership is a different story: to put it simply, Sun says that SCSI3-PR (persistent reservation) closes the door to data base corruptions in case of node eviction.

HP-UX: The story is not much different on HP-UX, although I would add one more major feature Service Guard has that does not exists on Solaris systems: the concept of a tie-breaking node.

HP-UX RAC storage options

As with Solaris, we have Service Guard, the generic cluster software from the hardware provider, Veritas Cluster Services and the Oracle clusterware alone.

10gR1 : The lack of mirroring for the OCR and voting disks makes a third party cluster almost compulsory (the “almost” is a word of caution because SAN mirroring technologies, based for example on virtualized luns might be used).

10gR2:

  • RAC+ Service Guard (with Service Guard Extension for RAC):
  1. SLVM, the shared volume option of the legacy HP-UX LVM had been the only option since the days of Oracle Parallel Server. It only allows data file raw access; hence archivelogs must be on “local” file systems (non shared volumes). A number of limitations, such as having to bring down all nodes but one for Volume Group maintenance exists.
  2. Since HP-UX 11iv2 and Service Guard 11.16, and after announcing for 2 years a port of the acclaimed True Cluster Advanced File Systems, HP finally closed an agreement with Veritas for the integration of the Veritas Cluster File Systems (CFS) and Cluster Volume Manager (CVM) into Service Guard. This is essentially supported by some extra SG commands (cfsdgadm, cfsdgadm and cfsmntadm), but does not provide much more significant functionality on top of the Veritas Storage Foundation for HA in the storage options. All of the Oracle files, including OCR and vote can be put on the CFS. This is because the CFS can be brought up before the CRS starts.
  • RAC + Veritas Cluster Services (VCS): The clusterware is Veritas based, but as you can guess, CFS and CVM are used in this option.
  • RAC without any third party cluster: As on Solaris, the only storage options with the Oracle clusterware alone is the ASM, which conveys for large data bases the same mirroring problems as found on Solaris.

Membership, split brain and amnesia

Split brain and tie breaking

Just like Sun Cluster Services and the Oracle clusterware, Service Guard has a tiebreaker cluster lock to avoid a split of two subclusters of equal size, a situation bound to arise when network connectivity is completely lost between some of the nodes. This tiebreaker is storage based (either a physical disk for SCS, an LVM volume for SG or a file for the CRS). However, geo clusters made of two physical sites make this tiebreaker configuration impossible: on which site do you put the extra volume, disk or files? A failure of the site chosen for 2 of the 3 volumes will cause the loss of the entire cluster. A third site HAS to be used. Assuming Fiber Channel is not extended to the 3rd site, an iscsi or NFS configuration (on a supported nfs provider such as NetApp) may be the most convenient option if an IP link exists.

With the Quorum Server, Service Guard can provide a tie breaking service: this can be done either as a stand-alone Quorum Server or as a Quorum Service package running on a cluster which is external to the cluster for which Quorum Service is being provided. Although the 3rd site is still needed, a stand alone server reachable via IP will resolve the split brain. The CRS is nonetheless still a cause of concern: the voting disk on the 3rd site must be seen by the surviving nodes. The Quorum Server may be a cluster service which belongs to a second SG cluster, but its CFS storage is not accessible from the first cluster. As the end of the day, although the Quorum Server is a nice SG feature, it cannot solve the CRS voting disks which must still be accessed through isci or NFS.

I/O Fencing

The same paragraph I wrote for Solaris can be copied and pasted:

Both VCS and SG use SCSI3 persistent reservation via ioctl, and I/O fencing to prevent corruption. Each node registers a key (it is the same for all the node paths). Once node membership is established, the registration keys of all the nodes that do not form part of the cluster are removed by the surviving nodes of the cluster. This blocks write access to the shared storage from evicted nodes.

April 2, 2007

No age scheduling for Oracle on HP-UX: HPUX_SCHED_NOAGE (2/2)

Filed under: HP-UX,Oracle — christianbilien @ 8:31 pm

The first post on this topic was presenting the normal HP-UX timeshare scheduling, under which all processes will run. Its main objective was to give the rules for context switching (threads releasing CPUs and getting a CPU back), before comparing it to the timeshare no age scheduling.

The decaying priority scheme has not been altered since the very early HP-UX workstations, at a time when I assume data bases were a remote if not non existent consideration to the HP-UX designers.

As a thread gets weaker because of a decreasing priority, its probability of being switched out will increase. If this process holds a critical resource (an Oracle latch, not even mentioning the log writer), this context switch will cause resource contention should other threads wait for the same resource.

A context switch (you can see them with in the pswch/s column of sar -w) is a rather expensive task: the system scheduler makes a copy of all the switched out process registries and copy into a private area known as the UAREA. The reverse operation needs to be performed for the incoming thread: its UAREA will be copied in the CPU registries. Another problem will of course occurs when CPU switches also happen on top of thread switches: modified cache lines need to be invalidated and read lines reloaded on the new CPU. I once read that a context switch would cost around 10 microseconds (I did not verify it myself), which is far for negligible. The HP-UX internals manuals mentions 10000 CPU cycle, which would indeed be translated into 10 microseconds with a 1Ghz CPU.

So when does a context switch occurs? HP-UX considers two cases: forced and voluntary context switches. Within the Oracle context, voluntary context switches are most of the time disk I/Os. A forced context switch will occur at the end of a timeslice when a thread with an equal or higher priority is runnable, or when a process returns from a system call or trap and a higher-priority thread is runnable.

Oradebug helps diagnosing the LGWR forced and volontary switches:

SQL> oradebug setospid 2022
Oracle pid: 6, Unix process pid: 2022, image: oracle@mymachine12 (LGWR)
SQL> oradebug unlimit
Statement processed.
SQL> oradebug procstat
Statement processed.
SQL> oradebug tracefile_name
/oracle/product/10.2.0/admin/MYDB/bdump/mydb_lgwr_2022.trc
SQL> !more /oracle/product/10.2.0/admin/MYDB/bdump/mydb_lgwr_2022.trc

Voluntary context switches = 272
Involuntary context switches = 167
….


The sched no age policy was added in HP-UX11i: it is part of the 178-225 range (the same range as the normal time share scheduler user priorities), but a thread running in this scheduler will not experience any priority decay: its priority will remain constant over time although the thread may be timesliced. Some expected benefits:

  • Less context switches overhead (lower cpu utilization)
  • The cpu hungry transactions may complete faster in a CPU busy system (the less hungry transactions may run slower, but as they do not spend much time on the CPUs, it may not be noticeable).
  • May help resolve an LGWR problem: the log writer inherently experiences a lot of voluntary context switches, which helps keeping a high priority. However, it may not be enough on CPU-busy systems.

As the root user, give the RTSCHED and RTPRIO privileges to the dba group setprivgrp dba RTSCHED RTPRIO. Create the /etc/privgroup file, if it does not exist, and add the following line to it: dba RTSCHED RTPRIO. Finally update the spfile/init with hpux_sched_noage=178 to 255. See rtsched(1).

March 31, 2007

No age scheduling for Oracle on HP-UX: HPUX_SCHED_NOAGE (1/2)

Filed under: HP-UX,Oracle — christianbilien @ 9:00 pm

The HPUX_SCHED_NOAGE initialization parameter has been around for some times, but in most of the situation where I found out it would be useful, sys admins and DBAs alike were a bit scared (to say the least) to use it. However, I cannot see a real threat to using this parameter: some degradation may occur if set inappropriately, but not outages and side effects associated with real time. Here are a couple of notes to explain this parameters usefulness. This first post will present time share generalities, the second one will be more specific about this parameter.

HP-UX scheduling policies can be classified using different criteria. One of them is priority range. Basically, processes can belong to one of 3 priorities ranges: rtprio and rtsched are real-time, HP-UX time share may be either real time or not. I am only interested here by the latest: HPUX_SCHED_NOAGE only interacts with the time share mechanism, not with the POSIX (rtsched) or rtprio (the oldest HPUX real time system) schedulers.

Time-share scheduling

This is the default HP-UX scheduler: its priority range is from 128 (highest) to 255 (lowest). It is itself divided between system (128-177) and user priorities (178-255). I’ll start to explain what happens to processes when HPUX_SCHED_NOAGE is not set in order to build up the case for using it.

1. Normal time-share scheduling (HPUX_SCHED_NOAGE unset)

Each CPU has a run queue, also sometimes designated as the dispatch queue (each queue length can be seen by sar –qM). Let’s start with a mono processor: processes using the normal time-share scheduling will experience priority changes over time.

CPU time is accumulated for the running thread (interestingly, this could lead to “phantom” thread using for a transaction less than 10ms (a tick) of CPU time, for which the probability of priority decay would be low). Every 40ms (4 ticks), the priority of the running thread is lowered. The priority degradation is a function of the CPU time accumulated during those 40ms (I do not know what the function formula is, but it is linear). Finally, a thread which has been running for TIMESLICE (default 10) x 10ms will be forced out (this is a forced context switch, seen in Glance). TIMESLICE is a changeable HP-UX kernel parameter. Thus, the default forced context switches will by occurs after a 100ms quantum. The thread waiting in the run queue having the highest priority will then be dispatched. Threads waiting (for a semaphore, an I/O, or any other reason) will regain their priority exponentially (I do not know the formula either).

Some known problems:

  • The scheduler assumes that all threads are of equal importance: this can be corrected by using nice(1) to slow down or accelerate the priority change when the process has no thread running.
  • CPU-hungry transactions tend to be bogged in low priorities: the more they need CPU, the less they get it.

ts.gif


What happens with SMPs? I covered the load balancing algorithm in

https://christianbilien.wordpress.com/2007/03/04/hp-ux-processor-load-balancing-on-smps and https://christianbilien.wordpress.com/2007/03/05/hp-ux-processor-load-balancing-on-smps-22.

Please read the next post which will be more specific about HPUX_SCHED_NOAGE.

March 25, 2007

HP-UX vpar memory: granule size matters

Filed under: HP-UX — christianbilien @ 6:55 pm

Memory is normally assigned to vPars in units called granules (although the vparcreate/vparmodify commands specify memory in multiples of 1MB, the vPar monitor will round up to the next multiple of granule size). As the granule size is specified when the vPar database is created and can not be changed without recreating the virtual partitions database, care must be taken to choose an appropriate granule size when the first vpar is created. Since this is a fairly complex subject, I thought the rules would deserve a note.

PA-RISC

Each vPar will require one ILM granule below 2GB to load its kernel. vpmon uses one granule below 2GB. Therefore (2GB ÷ granule size) -1 = maximum number of vPars. For example, 7 is the maximum number of vPars for an ILM granule size of 256MB (2GB ÷256MB -1 = 7).

Integrity (Itanium)

There is a platform dependent maximum to the number of granules of CLM/cell and of ILM per nPar. These values can be displayed using the vparenv command. Remember that Memory Size ÷ Granule size <= maximum # of granules.

 

Example:

# vparenv

vparenv: The next boot mode setting is “vPars”.
vparenv: The ILM granule size setting is 128.
vparenv: The CLM granule size setting is 128.
vparenv: Note: Any changes in the above settings will become effective only after the next system reboot.
vparenv: Note: The maximum possible CLM granules per cell is 512.
vparenv: Note: The maximum possible ILM granules for this system is 1024

Given the values above the total amount of CLM per cell must be less than 64GB (512 * 128MB) and the total amount of ILM in the nPar must be less than 128GB (1024*128 MB) .

Matching firmware and vpar granule size (Integrity only)

On Integrity systems the memory is divided into granules by the firmware. It is critical that the firmware value for the granule size matches the size in the vPars database. You can examine and modify the firmware setting using the vparenv command. For PA-RISC systems the memory is divided by granules by the monitor and there is no firmware setting. . You can ensure the firmware is updated with the same size as the database by specifying the y option: : vparcreate -g ilm:Mbytes:y –g clm:Mbytes:y. I am not sure what use can be made of diverging granule sizes.

Memory partitioning strategy: avoiding design traps on high end HP-UX systems. CLM and ILM (2/2)

Filed under: HP-UX — christianbilien @ 6:12 pm

 

CLM and ILM

As seen in the first post on this topic , since HP-UXiv2, and only when cells are dual-core capable (PA-RISC or Itanium 2), it is possible to identify memory on a cell or across an nPar as noninterleaved. This is called Cell-Local Memory, or CLM. CLM can be configured as a quantity or percentage of an individual cell’s memory, or a quantity or percentage of the memory across the entire nPar. Interleaved memory (ILM) is used when a portion of memory is taken from cells of the system and is mixed together in a round robin fashion. With processors on various cells accessing interleaved memory the average access time will be uniform. In 11i v1 all memory is designated as ILM.

The designation of memory as ILM vs. CLM is done at the nPar level (parcreate or parmodify). You can then allocate it to one or more of your vPars (vparcreate or vparmodify).

Cell local memory (CLM) can still be accessed by any processor, but processors on the same cell will have the lowest access latency. Access by processors in other cells will have higher latencies. It is always better to use ILM than accessing CLM configured in another cell.Note that CLM can be used to handle the case when there is an uneven amount of memory in the cells: the delta would be configured as CLM.

Psets

CLM and Processor Sets (Psets) can be used together to avoid the inconsistencies of ccNUMA almost entirely. In this context, locality domain (ldom) is defined as the CPUs and memory required to run a thread. A Pset is a logical grouping of CPUs, a CPU partition so to speak. Oracle processes bound to a given Pset get thread run time only on the CPUs assigned to the given Pset. ccNUMA is eliminated because the data and CPUs are on the same cell or ldom.

Next Page »

Blog at WordPress.com.