Christian Bilien’s Oracle performance and tuning blog

June 14, 2007

Spotlight on Oracle replication options within a SAN (1/2)

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 7:40 pm

Some interesting issues face the many sites wishful to implement a replication for data bases between two distant sites. One of the major decisions to be taken is HOW the replication will be performed, in other words what are the options and their pro and cons? I’ll start with generalities and then present some unitary tests performed in a Solaris/ASM/VxVM/EMC DMX environment.

1. The initial consideration is synchronous vs. asynchronous replication.


  • Synchronous means that the I/O has to be posted on the remote site for the transaction to be validated. Array based replications, such as HP’s Continuous Access or EMC’s SRDF will post the I/O from the local array cache to the remote, then wait for the ack to come back before acknowledging the I/O to the calling program. The main component in the overall response is the times it takes to write from the local cache to the remote cache and for the acknowledgment to come back. This latency is of course not felt by read accesses, but write time is heavily impacted (see the tests at the bottom of this post). The applications heavily waiting on “log file sync” events are the most sensitive to the synchronous write mechanism. I am preparing a post about the distance factor, i.e. how distance impacts response times.
  • Another aspect of synchronous replication is the bottleneck the replication will go through. Assuming a couple of 2GB/s replication ports, the replication bandwidth will be 4GB/s. It will need to accommodate the whole storage array write throughput, thereby potentially increasing the latency because processors will be busier, I/O will wait on array cache flushes and on other latches, etc.


To preserve consistency, asynchronous replication must implement some sequence-stamping that ensures that write operations at the remote node occur in the correct order. Loss may thus occur with EMC SRDF/A (A stands for adaptive) or HP’s CA asynchronous, but no data corruption should be experimented.

2. Host based vs. array based replication

Data Guard and volume managers (including the ASM) can be used to mirror the data base volumes from one array to the other one.

Data Guard

Data Guard works over TCP/IP.


  • IP links are common, relatively cheap and easy to set up.


  • Synchronous replication over IP means QOS (Quality Of Service) procedures to avoid other services clogging the links.
  • The commits must wait for the writes in the remote log file. The remote data base is asynchronously loaded from the remote log files. The more DML intensive the primary data base is, the wider the potential gap.

Volume management

Volume management is the only available options for some geographical clusters. RAC over Sun Cluster, RAC over ASM without 3rd party clusters, Mc ServiceGuard with the Cluster File System do not offer any other alternative (take a look at RAC geographical clusters and 3rd party clusters (HP-UX) for a discussion of RAC on geo clusters.

ASM is a also a volume manager as it is used for mirroring from one storage array to the other.


  • Fast (see the unitary tests). They also work best on aggregate: all of the storage array replicated writes go through a set of dedicated ports, which ends up bottlenecking on some array processors when others are mostly idle. VM writes are spread all over the array processors. So both scalability and unitary write speed are in favor of volume management mirroring.


  • Harder to manage and to maintain. Say that you want to configure an ASM with a lot of raid groups. Assuming the power_limit set to 0 prevents the automatic rebuild of the mirrored raid group because the rebuild would otherwise occur locally, you’ll have to add the newly created raid group into the rebuild script. Worse, you may forget it and realize one raid group is not mirrored the day the primary storage array fails. The most classic way to fail a cluster switchover is to forget to reference newly created file systems or tablespaces.
  • Usually works over Fiber Channel, although FC-IP can be used to extend the link distance.
  • No asynchronous replication except for the Veritas Volume Replicator which is to my knowledge the only VM able to perform async writes on a remote array.

Array based replication


  • Usually easier to manager. The maintenance and switchover tasks may also be offloaded on the storage team. Host based replication management either puts the ball in the DBA camp (if using ASM) or to the sys admins (for other VM).
  • Asynchronous replication
  • Vendors offer remote monitoring
  • Snapshots can be made on the distant sites for development, report or other purposes.


  • Performance as seen above.
  • Same limitations with the Fiber Channel.



May 25, 2007

Oracle ISM and DISM: more than a no paging scheme (2/2)… but be careful with Solaris 8

Filed under: Oracle,Solaris — christianbilien @ 9:39 pm

This post is the DISM follow up to the ISM-only Oracle ISM and DISM: Oracle ISM and DISM: more than a no paging scheme (1/2).

DISM (Dynamic Intimate Shared Memory) is the pageable variant of ISM. DISM was made available on Solaris 8. The DISM segment is attached to a process through the shmat system call. SHM_DYNAMIC is a new flag that tells shmat to create Dynamic ISM rather than the SHM_SHARE_MMU flag used for ISM.

DISM is like ISM except that it isn’t automatically locked. The application, not the kernel does the locking, which is done by using mlock. Kernel virtual-to-physical memory address translation structures are shared among processes that attach to the DISM segment. This is one of the DISM benefits: saving kernel memory and CPU time. As with ISM, shmget creates the segment. The shmget size specified is the maximum size of the segment. The size of the segment can be larger than physical memory. Enough of disk swap should be made available to cover the maximum possible DISM size.

Per the Oracle 10gR2 installation guide on Solaris platforms:

Oracle Database automatically selects ISM or DISM based on the following criteria:

  • Oracle Database uses DISM if it is available on the system, and if the value of the SGA_MAX_SIZE initialization parameter is larger than the size required for all SGA components combined. This enables Oracle Database to lock only the amount of physical memory that is used.
  • Oracle Database uses ISM if the entire shared memory segment is in use at startup or if the value of the SGA_MAX_SIZE parameter is equal to or smaller than the size required for all SGA components combined. 

I ran a few logical I/O intensive tests aimed at highlighting some possible performance loss when moving from ISM to DISM (as pages are not permanently locked in memory, swap management has to be invoked), but I couldn’t find any meaningful difference. Most of the benefits I described in the Oracle ISM and DISM: more than a no paging scheme (1/2) post still applies, except for the non-support of large pages in Solaris 8 (see below).

Since DISM requires the application to lock memory, and since memory locking can only be carried out by applications with superuser privileges, the $ORACLE_HOME/bin/oradism daemon run as root using setuid (early 9i releases had a different mechanism, using RBAC instead of setuid).

Solaris 8 problems:

Dynamic Intimate Shared Memory (DISM) was introduced in the 1/01 release of Solaris 8 (Update 3). DISM was supported by Oracle9i for SGA resizing.

On a 10gR2 database running on Solaris 10, it can be seen than large pages are used by DISM :

pmap -sx 19609| more

19609: oracleSID11 (LOCAL=NO)

Address Kbytes RSS Anon Locked Pgsz Mode Mapped File
0000000380000000 16384 16384 4M rwxs- [ dism shmid=0x70000071 ]

Per the following Sun Solve note

“In this first release, large MMU pages were not supported. For Solaris 8 systems with 8GB of memory or less, it is reasonable to expect a performance degradation of up to 10% compared to ISM, due to the lack of large page support in DISM […] Sun recommends avoiding DISM on Solaris 8 either where SGAs are greater than 8 Gbytes in size, or on systems with a typical CPU utilization of 70% or more. In general, where performance is critical, DISM should be avoided on Solaris 8. As we will see, Solaris 9 Update 2 (the 12/02 release) is the appropriate choice for using DISM with systems of this type.” from Sun advocates on Solaris 8 the use of DISM primarily for the machine maintenance, such as removing a memory board, but it fails to mention that large MMU pages are not supported.

May 14, 2007

Oracle ISM and DISM: more than a no paging scheme (1/2)

Filed under: Oracle,Solaris — christianbilien @ 12:54 pm

This post only deals with ISM. I’ll write second one about Dynamic ISM (DISM) .

A long standing problem on any platform has been the probability that part of the Oracle memory segment gets swapped out and that what is a relatively memory fast access turns into a horrid bottleneck. Oracle 9i on Solaris made use of an interesting feature named Intimate Shared Memory (ISM) which in fact makes a lot more than what one may think of initially.

The very first benefit of ISM (not DISM for the time being) is that the shared memory is locked by the kernel when the segment is created: the memory cannot be paged out. A small price to pay to the locking mechanism is that sufficient available unlocked memory must exist for the allocation to succeed.

Because the SHM_SHARE_MMU flag is set in the shmat system call to set up the shared segment as ISM, there are less known benefits, which may be of a higher importance than the no paging scheme on CPU bounds systems.


Shared kernel virtual-to-physical translation

The virtual to physical mapping is one of the most consuming tasks any modern operating system has to perform. The hardware Translation Lookaside buffer (TLB) is a physical cache to the slower in-memory tables. The Translation Storage Buffer (TSB) is a further translation in memory cache. As even in Solaris 10 the standard System V algorithm is still to have a private virtual address space for each process, aliasing (several virtual addresses exist that map to the same physical address).

ISM allows the sharing of kernel virtual-to-physical memory between processes that attach to the shared memory, saving considerable translation slots in the hardware TLB. This can be monitored on Solaris 10 by trapstat:

# trapstat -T

cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim


512 u 8k| 1761 0.1 2841 0.2 | 2594 0.1 2648 0.2 | 0.5

512 u 64k| 0 0.0 0 0.0 | 8 0.0 0 0.0 | 0.0

512 u 512k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

512 u 4m| 20 0.0 1 0.0 | 4 0.0 0 0.0 | 0.0

512 u 32m| 0 0.0 0 0.0 | 11 0.0 0 0.0 | 0.0

512 u 256m| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

trapstat show both instruction and data misses in both the TLB and the TSB.

Solaris 8 does not have trapstat, so the trick is to use cpustat:

On a non-idle Oracle system using ISM as seen below,

mpstat 5 5

CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl

0 0 0 282 728 547 1842 283 329 62 10 3257 40 8 25 27

1 0 0 122 227 2 1954 284 327 55 9 3639 39 6 29 26

2 0 0 257 1578 1399 1887 288 330 58 9 3287 35 11 27 27

3 1 0 313 1758 1501 1933 285 328 70 12 3437 36 8 29 27

cpustat -c pic0=Cycle_cnt,pic1=DTLB_miss 1

time cpu event pic0 pic1

1.010 3 tick 192523799 29658

1.010 2 tick 270995815 28499

1.010 0 tick 225156772 29621

1.010 1 tick 234603152 29034

psrinfo –v

Status of processor 3 as of: 05/14/07 12:48:53
Processor has been on-line since
03/11/07 10:35:22.
The sparcv9 processor operates at 1062 MHz,
and has a sparcv9 floating point processor.

cpustat shows that on processor 3, we have 29658 dTLB misses on this sample. UltraSparcIII will use somewhere between 50 cycles (most favourable case: no TLB entry miss) and 300 cycles (worst case: a memory load has to be performed to compute the translation) to handle dTLB accesses. It will take in the best scenario 1.5 million cycles per seconds and 8.9 millions in the worst to handle the misses. At 1062Mhz, the time spent handling dTLB misses is only between 0.14% and 0.84% !

Large pages.

From Solaris 2.6 through Solaris 8, large pages are only available through the use of the ISM (using SHM_SHARE_MMU).

Solaris 8


Solaris 10 :default pagesize


Supported page sizes:

pagesize -a




ISM page size on Solaris 10 (look at the pgsz column). It looks like Oracle is using the largest page available


pmap -sx 25921

25921: oracleSID1 (LOCAL=NO)

Address Kbytes RSS Anon Locked Pgsz Mode Mapped File

00000001064D2000 24 24 24 8K rwx– [ heap ]

00000001064D8000 32 8 – rwx– [ heap ]

0000000380000000 1048576 1048576 1048576 4M rwxsR [ ism shmid=0x6f000078 ]

AMD 64/x64. The AMD Opteron processor supports both 4Kbyte and 2Mbyte page sizes:

pagesize -a



x86. The implementation of Solaris on x86 processors provides support for 4Kbyte pages only.


This post will be followed up by a discussion about DISM, the differences with ISM and a word of caution about using DISM on Solaris 8:
Oracle ISM and DISM: more than a no paging scheme…but be careful with Solaris 8 (2/2)

May 1, 2007

Two useful hidden parameters: _smm_max_size and _pga_max_size.

Filed under: Oracle — christianbilien @ 8:07 pm

You may have heard of the great “Burleson” vs “Lewis” controversy about the spfile/init hidden pga parameters. You can have a glimpse at the battlefield on (Jonathan Lewis side) and to be fair on the opposite side If you have the courage to fight your way in the intricacies of the arguments, and also taking into account that 1) this war only seems to apply to 9i 2) the above url (at least the one from Don Burleson) may have been rewritten since comments were made by Jonathan Lewis, you may be left with a sense of misunderstanding as I was when I went through it.

1) Is it worth fiddling with those undocumented parameters ?

The answer is yes (at least for me). Not that I am a great fan of setting those parameters for fun in production, it is just that I encountered thrice in a year a good reason for setting them.

2) How does it work ?

I tried to understand from the various Metalink and searches across the web what their meaning was, and I then verified the values hoping not to miss something.

I’ll avoid a discussion over parallel operations, as the settings are more complex and depends upon the degree of parallelism. I’ll spend some times investigating this but for now I’ll stick with what I tested.

1. Context

The advent of automatic pga management (if enabled!) in Oracle 9i was meant to be a relieve to the *_area_size parameters dictating how much a sort area could reach before the temp tbs would be used. Basically, the sort area sizes where acting as a threshold: your sort was performed in memory if the required sort memory was smaller than the threshold, and it went on disk if larger. The trouble with this strategy was that the sort area had to be small enough to accommodate many processes sorting at the same time, but on the other hand a large sort alone on the instance could only be using up to the sort area size before spilling on disk. The sort area sizes mutualized under the pga umbrella just removed these shortcomings. However, the Oracle designers had to cope with the possibility of a process hogging the sort memory, leaving no space for others. This is why some limitations to the sort memory available to a workarea and to a single process were put in place, using a couple of hidden parameters:

_smm_max_size: Maximum workarea size for one process

_pga_max_size: Maximum PGA size for a single process

The sorts will go on disk if any of those two thresholds are crossed.

2. Default values

9i (and probably 10gR1, which I did not test):

_pga_max_size: default value is 200MB.

_smm_max_size : default value is the least of 5% of pga_aggregate_target and of 50% of _pga_max_size. A ceiling of 100MB also applies. The ceiling is hit when the pga_aggregate_target exceeds 2GB (5% of 2GB = 10MB) or


_pga_max_size is set to a higher value than the default AND pga_aggregate_target is lower than 2GB.


pga_aggregate_target now drives in most cases _smm_max_size:

pga_aggregate_target <=500MB, _smm_max_size = 20%* pga_aggregate_target

pga_aggregate_target between 500MB and 1000MB, _smm_max_size = 100MB

pga_aggregate_target >1000MB, _smm_max_size = 10%* pga_aggregate_target

and _smm_max_size in turns now drives _pga_max_size: _pga_max_size = 2 * _smm_max_size

A pga_aggregate_target larger than 1000MB will now allow much higher default thresholds in 10gR2: pga_aggregate_target set to 5GB will allow an _smm_max_size of 500MB (was 100MB before) and _pga_max_size of 1000MB (was 200MB).

You can get the hidden parameter values by querying x$ksppcv and x$ksppi as follows:

select a.ksppinm name, b.ksppstvl value from sys.x$ksppi a,sys.x$ksppcv b where a.indx = b.indx and a.ksppinm=’_smm_max_size’;

select a.ksppinm name, b.ksppstvl value from sys.x$ksppi a,sys.x$ksppcv b where a.indx = b.indx and a.ksppinm=’_pga_max_size’;

April 19, 2007

RAC geographical clusters and 3rd party clusters (HP-UX) (2/3)

Filed under: HP-UX,Oracle,RAC — christianbilien @ 12:38 pm

This is the second post in a series of 3 related to geographical clusters (the first post was focusing on Solaris RAC, the last will be about cluster features such as fencing, quorum, cluster lock, etc.). It would be beneficial to read the first geo cluster post which aimed at Solaris as many issues, common to HP-UX and Solaris, will not be rehashed.


Solaris: to make my point straight, I think that the two major differences between the group of generic clusters Sun Cluster Services/ Veritas Cluster Services and the Oracle CRS clusterware are storage and membership. Their nature is however not the same: the storage chapter is essentially related to available options, with ASM being used as the sole option for mirroring in CRS only clusters. This may be a major drawback for large databases. Membership is a different story: to put it simply, Sun says that SCSI3-PR (persistent reservation) closes the door to data base corruptions in case of node eviction.

HP-UX: The story is not much different on HP-UX, although I would add one more major feature Service Guard has that does not exists on Solaris systems: the concept of a tie-breaking node.

HP-UX RAC storage options

As with Solaris, we have Service Guard, the generic cluster software from the hardware provider, Veritas Cluster Services and the Oracle clusterware alone.

10gR1 : The lack of mirroring for the OCR and voting disks makes a third party cluster almost compulsory (the “almost” is a word of caution because SAN mirroring technologies, based for example on virtualized luns might be used).


  • RAC+ Service Guard (with Service Guard Extension for RAC):
  1. SLVM, the shared volume option of the legacy HP-UX LVM had been the only option since the days of Oracle Parallel Server. It only allows data file raw access; hence archivelogs must be on “local” file systems (non shared volumes). A number of limitations, such as having to bring down all nodes but one for Volume Group maintenance exists.
  2. Since HP-UX 11iv2 and Service Guard 11.16, and after announcing for 2 years a port of the acclaimed True Cluster Advanced File Systems, HP finally closed an agreement with Veritas for the integration of the Veritas Cluster File Systems (CFS) and Cluster Volume Manager (CVM) into Service Guard. This is essentially supported by some extra SG commands (cfsdgadm, cfsdgadm and cfsmntadm), but does not provide much more significant functionality on top of the Veritas Storage Foundation for HA in the storage options. All of the Oracle files, including OCR and vote can be put on the CFS. This is because the CFS can be brought up before the CRS starts.
  • RAC + Veritas Cluster Services (VCS): The clusterware is Veritas based, but as you can guess, CFS and CVM are used in this option.
  • RAC without any third party cluster: As on Solaris, the only storage options with the Oracle clusterware alone is the ASM, which conveys for large data bases the same mirroring problems as found on Solaris.

Membership, split brain and amnesia

Split brain and tie breaking

Just like Sun Cluster Services and the Oracle clusterware, Service Guard has a tiebreaker cluster lock to avoid a split of two subclusters of equal size, a situation bound to arise when network connectivity is completely lost between some of the nodes. This tiebreaker is storage based (either a physical disk for SCS, an LVM volume for SG or a file for the CRS). However, geo clusters made of two physical sites make this tiebreaker configuration impossible: on which site do you put the extra volume, disk or files? A failure of the site chosen for 2 of the 3 volumes will cause the loss of the entire cluster. A third site HAS to be used. Assuming Fiber Channel is not extended to the 3rd site, an iscsi or NFS configuration (on a supported nfs provider such as NetApp) may be the most convenient option if an IP link exists.

With the Quorum Server, Service Guard can provide a tie breaking service: this can be done either as a stand-alone Quorum Server or as a Quorum Service package running on a cluster which is external to the cluster for which Quorum Service is being provided. Although the 3rd site is still needed, a stand alone server reachable via IP will resolve the split brain. The CRS is nonetheless still a cause of concern: the voting disk on the 3rd site must be seen by the surviving nodes. The Quorum Server may be a cluster service which belongs to a second SG cluster, but its CFS storage is not accessible from the first cluster. As the end of the day, although the Quorum Server is a nice SG feature, it cannot solve the CRS voting disks which must still be accessed through isci or NFS.

I/O Fencing

The same paragraph I wrote for Solaris can be copied and pasted:

Both VCS and SG use SCSI3 persistent reservation via ioctl, and I/O fencing to prevent corruption. Each node registers a key (it is the same for all the node paths). Once node membership is established, the registration keys of all the nodes that do not form part of the cluster are removed by the surviving nodes of the cluster. This blocks write access to the shared storage from evicted nodes.

April 17, 2007

RAC geographical clusters and 3rd party clusters (Sun Solaris) (1/3)

Filed under: Oracle,RAC,Solaris — christianbilien @ 9:06 pm

As a word of introduction a geographical RAC cluster is a RAC where at least one node is physically located in a remote location, and DB access is still available should one of the sites go down.

I found that many customers wishing to implement a RAC geo cluster get confused by vendors when it comes to the RAC relationships (or should I say dependencies) with third party clusters. I also have the impression that some Oracle sales rep tend to participate to this confusion by encouraging troubled prospects in one way or in another, depending of their particular interest with a hardware/cluster 3rd party provider.

Let’s first say that I am here just addressing the RAC options. Assuming some other applications need clustering services, a third party cluster will be necessary (although some provisions, still in infancy, exist within the CRS to “clusterize” non-RAC services). I’ll also deliberately not discuss NAS storage as I never had the opportunity to work or even consider a RAC/NAS option (Pillar, NetApp, and a few others are trying to get into this market).

This first post is about RAC geo clusters on Solaris. RAC geo clusters on HP-UX will be covered here.

The Solaris compatibility matrix is located at

I consider two cluster areas to be strongly impacted by the “third party cluster or not” choice: storage and membership strategy. Some may also argue about private interconnect protection against failure, but since IPMP may be used for the RAC-only option, and although some technical differences exist, I think that this is a matter of much less importance that storage and membership.


  • 10gR1 was very special as it did not have any Oracle protection for the vote and ocr volumes. This lack of functionality had a big impact on geo clusters as some third party storage clustering was required for vote and OCR mirroring.
  • 10gR2: The options may not be the same for OCR/vote, data base files, binary and archivelog files. Although archivelog files on a clustered file system saves NFS mounts, binary and archivelogs may usually be located on their own “local” file system which may on the array, but only seen from one node. The real issues are on one hand the DB files, on the other hand the OCR and voting disk which are peculiar because they must be seen when the CRS starts, BEFORE the ASM or any Oracle dependent process can be started.
  • RAC+ Sun Cluster (SCS): The storage can either be a Solaris volume manager and raw devices or QFS, GFS is not supported. ASM may be used but offers little in my opinion compared to a volume manager. ASM used for mirroring suffers from the mirroring reconstruction that has to be performed when one of site is lost and the lack of any feature similar to a copy of modified blocks only (the way storage mirroring does).
  • RAC + Veritas Cluster Services (VCS): the Veritas cluster file system (the VxFs cluster version), running over the Cluster Volume Manager (the VxVm cluster version) is certainly a good solutions for those adverse to raw device/ASM. All of the Oracle files, including OCR and vote can be put on the CFS. This is because the CFS can be brought up before the CRS starts.
  • RAC without any third party cluster: ASM has to be used for storage mirroring. This is easier to manage and cheaper, although mirrored disk group reconstruction is a concern when volumes are high. I also like not to avoid the coexistence of two clusters (RAC on top of SCS or VCS).

Membership, split brain and amnesia

A number of membership issues are addressed differently by SCS/VCS and the CRS/CSS. It is beyond the scope of this post to explain fencing, split brain and amnesia. There are really two worlds here: on one hand, Oracle has a generic clusterware membership system across platforms, which avoids system and storage dependency, on the other hand VCS and SCS take advantage of SCSI persistent reservation ioctls. Veritas and Sun both advocate that Oracle’s node eviction strategy may create situations in which a node would be evicted from the cluster, but not forced to the boot yet. Other instances may then start recovering instances while the failed instance stills write to the shared storage. Oracle says that database corruption is prevented by using the the voting disk, network, and the control file to determine when a remote node is down. This is done in different, parallel, independent ways. I am not going to enter the war on one side or another, let’s just recall the basic strategies:

  • CSS: this process uses both the interconnects and the voting disks to monitor remote node. A node must be able to access strictly more than half of the voting disks at any time (this is the reason for the odd number of voting disks), which prevents split brain. The css miscount is 30s, which is the network heartbeat time allowance for not responding before eviction.
  • Both VCS and SCS use SCSI3 persistent reservation via ioctl, and I/O fencing to prevent corruption. Each node registers a key (it is the same for all the node paths). Once node membership is established, the registration keys of all the nodes that do not form part of the cluster are removed by the surviving nodes of the cluster. This blocks write access to the shared storage from evicted nodes.

One last bit: although not a mainstream technology (and it won’t improve now that RDS over Infiniband is an option on Linux and soon on Solaris), I believe SCS is needed to allow RSM over SCI/ SunFire Link to be used. The specs show quite an impressive latency of a few micro seconds.

April 2, 2007

No age scheduling for Oracle on HP-UX: HPUX_SCHED_NOAGE (2/2)

Filed under: HP-UX,Oracle — christianbilien @ 8:31 pm

The first post on this topic was presenting the normal HP-UX timeshare scheduling, under which all processes will run. Its main objective was to give the rules for context switching (threads releasing CPUs and getting a CPU back), before comparing it to the timeshare no age scheduling.

The decaying priority scheme has not been altered since the very early HP-UX workstations, at a time when I assume data bases were a remote if not non existent consideration to the HP-UX designers.

As a thread gets weaker because of a decreasing priority, its probability of being switched out will increase. If this process holds a critical resource (an Oracle latch, not even mentioning the log writer), this context switch will cause resource contention should other threads wait for the same resource.

A context switch (you can see them with in the pswch/s column of sar -w) is a rather expensive task: the system scheduler makes a copy of all the switched out process registries and copy into a private area known as the UAREA. The reverse operation needs to be performed for the incoming thread: its UAREA will be copied in the CPU registries. Another problem will of course occurs when CPU switches also happen on top of thread switches: modified cache lines need to be invalidated and read lines reloaded on the new CPU. I once read that a context switch would cost around 10 microseconds (I did not verify it myself), which is far for negligible. The HP-UX internals manuals mentions 10000 CPU cycle, which would indeed be translated into 10 microseconds with a 1Ghz CPU.

So when does a context switch occurs? HP-UX considers two cases: forced and voluntary context switches. Within the Oracle context, voluntary context switches are most of the time disk I/Os. A forced context switch will occur at the end of a timeslice when a thread with an equal or higher priority is runnable, or when a process returns from a system call or trap and a higher-priority thread is runnable.

Oradebug helps diagnosing the LGWR forced and volontary switches:

SQL> oradebug setospid 2022
Oracle pid: 6, Unix process pid: 2022, image: oracle@mymachine12 (LGWR)
SQL> oradebug unlimit
Statement processed.
SQL> oradebug procstat
Statement processed.
SQL> oradebug tracefile_name
SQL> !more /oracle/product/10.2.0/admin/MYDB/bdump/mydb_lgwr_2022.trc

Voluntary context switches = 272
Involuntary context switches = 167

The sched no age policy was added in HP-UX11i: it is part of the 178-225 range (the same range as the normal time share scheduler user priorities), but a thread running in this scheduler will not experience any priority decay: its priority will remain constant over time although the thread may be timesliced. Some expected benefits:

  • Less context switches overhead (lower cpu utilization)
  • The cpu hungry transactions may complete faster in a CPU busy system (the less hungry transactions may run slower, but as they do not spend much time on the CPUs, it may not be noticeable).
  • May help resolve an LGWR problem: the log writer inherently experiences a lot of voluntary context switches, which helps keeping a high priority. However, it may not be enough on CPU-busy systems.

As the root user, give the RTSCHED and RTPRIO privileges to the dba group setprivgrp dba RTSCHED RTPRIO. Create the /etc/privgroup file, if it does not exist, and add the following line to it: dba RTSCHED RTPRIO. Finally update the spfile/init with hpux_sched_noage=178 to 255. See rtsched(1).

March 31, 2007

No age scheduling for Oracle on HP-UX: HPUX_SCHED_NOAGE (1/2)

Filed under: HP-UX,Oracle — christianbilien @ 9:00 pm

The HPUX_SCHED_NOAGE initialization parameter has been around for some times, but in most of the situation where I found out it would be useful, sys admins and DBAs alike were a bit scared (to say the least) to use it. However, I cannot see a real threat to using this parameter: some degradation may occur if set inappropriately, but not outages and side effects associated with real time. Here are a couple of notes to explain this parameters usefulness. This first post will present time share generalities, the second one will be more specific about this parameter.

HP-UX scheduling policies can be classified using different criteria. One of them is priority range. Basically, processes can belong to one of 3 priorities ranges: rtprio and rtsched are real-time, HP-UX time share may be either real time or not. I am only interested here by the latest: HPUX_SCHED_NOAGE only interacts with the time share mechanism, not with the POSIX (rtsched) or rtprio (the oldest HPUX real time system) schedulers.

Time-share scheduling

This is the default HP-UX scheduler: its priority range is from 128 (highest) to 255 (lowest). It is itself divided between system (128-177) and user priorities (178-255). I’ll start to explain what happens to processes when HPUX_SCHED_NOAGE is not set in order to build up the case for using it.

1. Normal time-share scheduling (HPUX_SCHED_NOAGE unset)

Each CPU has a run queue, also sometimes designated as the dispatch queue (each queue length can be seen by sar –qM). Let’s start with a mono processor: processes using the normal time-share scheduling will experience priority changes over time.

CPU time is accumulated for the running thread (interestingly, this could lead to “phantom” thread using for a transaction less than 10ms (a tick) of CPU time, for which the probability of priority decay would be low). Every 40ms (4 ticks), the priority of the running thread is lowered. The priority degradation is a function of the CPU time accumulated during those 40ms (I do not know what the function formula is, but it is linear). Finally, a thread which has been running for TIMESLICE (default 10) x 10ms will be forced out (this is a forced context switch, seen in Glance). TIMESLICE is a changeable HP-UX kernel parameter. Thus, the default forced context switches will by occurs after a 100ms quantum. The thread waiting in the run queue having the highest priority will then be dispatched. Threads waiting (for a semaphore, an I/O, or any other reason) will regain their priority exponentially (I do not know the formula either).

Some known problems:

  • The scheduler assumes that all threads are of equal importance: this can be corrected by using nice(1) to slow down or accelerate the priority change when the process has no thread running.
  • CPU-hungry transactions tend to be bogged in low priorities: the more they need CPU, the less they get it.


What happens with SMPs? I covered the load balancing algorithm in and

Please read the next post which will be more specific about HPUX_SCHED_NOAGE.

March 28, 2007

Log buffer issues in 10g (2/2)

Filed under: Oracle — christianbilien @ 7:29 pm


The first post on this topic was a quick recap of log buffer generalities. The second will address what’s new in 10gR2.

Part 2: Destaging the log buffer

First bullet: the log buffer is NOT tunable anymore in 10gR2 (whatever the LOG_BUFFER initialization parameter)

From Bug 4592994:

In 10G R2, Oracle combines fixed SGA area and redo buffer [log buffer] together. If there is a free space after Oracle puts the combined buffers into a granule, that space is added to the redo buffer. Thus you see redo buffer has more space as expected. This is an expected behavior.

In 10.2 the log buffer is rounded up to use the rest of the granule. The granule size can be found from the hidden parameter “_ksmg_granule_size”. The log buffer size and granule size can be read from v$sgainfo:

SQL> select * from v$sgainfo;


——————————– ———- —
Fixed SGA Size 2033832 No
Redo Buffers 14737408 No
Buffer Cache Size 3405774848 Yes
Shared Pool Size 520093696 Yes
Large Pool Size 16777216 Yes
Java Pool Size 33554432 Yes
Streams Pool Size 0 Yes
Granule Size 16777216 No
Maximum SGA Size 3992977408 No
Startup overhead in Shared Pool 234881024 No


The log buffer can also be read from:

sql> select TOTAL_BUFS_KCRFA from x$kcrfstrand;

or from the lgwr log file (not the same instance):

ksfqxc:ctx=0x79455cf0 flags=0x30000000 dev=0x0 lbufsiz=1048576 bp=0x6497928

As a matter of fact, 10 or 16MB are huge values. Given the LGWR writes rule (see first post), LGWR will only begin performing background writes when the 1MB threshold is reached if no user transaction commits in the mean time. Log file parallel write are bound to appear.

This circumstance often appears in PL/SQL which contains COMMIT WORK: PL/SQL actually does commit, but does not wait for it. In other words, a COMMIT WAIT within PL/SQL is actually a COMMIT NOWAIT, but the wait is performed when the PL/SQL block or procedure exits to the calling program. This is where _LOG_IO_SIZE shows its usefulness: _LOG_IO_SIZE is a divisor of the log buffer size, and will trigger the LGWR to write whenever (log buffer size divided by _LOG_IO_SIZE > size of pending redo entries). The LGWR write threshold defaults in 9i to 1MB or 1/3 of the log buffer, whichever is less. kcrfswth from “oradebug call kcrfw_dump_interesting_data” is the threshold: it is expressed in OS block size, thus the current _LOG_IO_SIZE can be derived from kcrfswth.

LEBSZ from x$kccle is the operating system block size (512 bytes here).

SQL> oradebug setospid 3868;

Oracle pid: 11, Unix process pid: 3868, image: oracle@mymachine (LGWR)

SQL> oradebug unlimit

Statement processed.

SQL> oradebug call kcrfw_dump_interesting_data

SQL> oradebug tracefile_name


SQL> !grep kcrfswth /oracle/10.2.0/admin/MYSID/bdump/eptdb1_lgwr_3868.trc

kcrfswth = 2048

SQL> select max(LEBSZ) from x$kccle;




SQL> select value from v$parameter where name = ‘log_buffer’;




SQL> select 14289920/512/2048 from dual;




The threshhold will be about 1/13e of the log buffer size (roughly 1MB here).

Log buffer issues in 10g (1/2)

Filed under: Oracle — christianbilien @ 7:19 pm

Traditional log buffer tuning has been changed in 10gR2, which prompts some problems for highly intensive DB. The first post is a quick recap of log buffer generalities, the second will focus on 10gR2 changes and suggest some log buffer management tips.

Part 1: log buffer generalities

Undersized log buffer is bound to cause sessions to fight for space in the log buffer. Those events are seen under the “log buffer space” event (configuration wait class in 10g). Oversized log buffer are the cause of less known issues: infrequent commits database wise allow log entries to pile up in the log buffer, causing high volumes of LGWR I/Os at commit time, longer “log file sync” experienced by foreground processes and “log file parallel writes” by the LGWR. Remember that only four cases (and not 3 as often written – there is one additional known case that is RAC specific) trigger a LGWR write:

  • A transaction commits or rollbacks
  • Every 3 seconds
  • When the log buffer is 1/3 full or the total of redo entries is 1MB (default _LOG_IO_SIZE), whichever case occurs first.
  • A less documented case occurs on RAC infrastructure when a dirty block has to be read by another instance: redo entries associated with a block must be flushed prior to the transfer. This event is called write ahead logging.

To wrap up the case for log buffer tuning, a couple of latches must be reckon with: the redo writing and redo copy latches which may cause SLEEPS (session waiting) when the LGWR is overactive.


« Previous PageNext Page »

Blog at