Christian Bilien’s Oracle performance and tuning blog

April 19, 2007

RAC geographical clusters and 3rd party clusters (HP-UX) (2/3)

Filed under: HP-UX,Oracle,RAC — christianbilien @ 12:38 pm

This is the second post in a series of 3 related to geographical clusters (the first post was focusing on Solaris RAC, the last will be about cluster features such as fencing, quorum, cluster lock, etc.). It would be beneficial to read the first geo cluster post which aimed at Solaris as many issues, common to HP-UX and Solaris, will not be rehashed.

Introduction

Solaris: to make my point straight, I think that the two major differences between the group of generic clusters Sun Cluster Services/ Veritas Cluster Services and the Oracle CRS clusterware are storage and membership. Their nature is however not the same: the storage chapter is essentially related to available options, with ASM being used as the sole option for mirroring in CRS only clusters. This may be a major drawback for large databases. Membership is a different story: to put it simply, Sun says that SCSI3-PR (persistent reservation) closes the door to data base corruptions in case of node eviction.

HP-UX: The story is not much different on HP-UX, although I would add one more major feature Service Guard has that does not exists on Solaris systems: the concept of a tie-breaking node.

HP-UX RAC storage options

As with Solaris, we have Service Guard, the generic cluster software from the hardware provider, Veritas Cluster Services and the Oracle clusterware alone.

10gR1 : The lack of mirroring for the OCR and voting disks makes a third party cluster almost compulsory (the “almost” is a word of caution because SAN mirroring technologies, based for example on virtualized luns might be used).

10gR2:

  • RAC+ Service Guard (with Service Guard Extension for RAC):
  1. SLVM, the shared volume option of the legacy HP-UX LVM had been the only option since the days of Oracle Parallel Server. It only allows data file raw access; hence archivelogs must be on “local” file systems (non shared volumes). A number of limitations, such as having to bring down all nodes but one for Volume Group maintenance exists.
  2. Since HP-UX 11iv2 and Service Guard 11.16, and after announcing for 2 years a port of the acclaimed True Cluster Advanced File Systems, HP finally closed an agreement with Veritas for the integration of the Veritas Cluster File Systems (CFS) and Cluster Volume Manager (CVM) into Service Guard. This is essentially supported by some extra SG commands (cfsdgadm, cfsdgadm and cfsmntadm), but does not provide much more significant functionality on top of the Veritas Storage Foundation for HA in the storage options. All of the Oracle files, including OCR and vote can be put on the CFS. This is because the CFS can be brought up before the CRS starts.
  • RAC + Veritas Cluster Services (VCS): The clusterware is Veritas based, but as you can guess, CFS and CVM are used in this option.
  • RAC without any third party cluster: As on Solaris, the only storage options with the Oracle clusterware alone is the ASM, which conveys for large data bases the same mirroring problems as found on Solaris.

Membership, split brain and amnesia

Split brain and tie breaking

Just like Sun Cluster Services and the Oracle clusterware, Service Guard has a tiebreaker cluster lock to avoid a split of two subclusters of equal size, a situation bound to arise when network connectivity is completely lost between some of the nodes. This tiebreaker is storage based (either a physical disk for SCS, an LVM volume for SG or a file for the CRS). However, geo clusters made of two physical sites make this tiebreaker configuration impossible: on which site do you put the extra volume, disk or files? A failure of the site chosen for 2 of the 3 volumes will cause the loss of the entire cluster. A third site HAS to be used. Assuming Fiber Channel is not extended to the 3rd site, an iscsi or NFS configuration (on a supported nfs provider such as NetApp) may be the most convenient option if an IP link exists.

With the Quorum Server, Service Guard can provide a tie breaking service: this can be done either as a stand-alone Quorum Server or as a Quorum Service package running on a cluster which is external to the cluster for which Quorum Service is being provided. Although the 3rd site is still needed, a stand alone server reachable via IP will resolve the split brain. The CRS is nonetheless still a cause of concern: the voting disk on the 3rd site must be seen by the surviving nodes. The Quorum Server may be a cluster service which belongs to a second SG cluster, but its CFS storage is not accessible from the first cluster. As the end of the day, although the Quorum Server is a nice SG feature, it cannot solve the CRS voting disks which must still be accessed through isci or NFS.

I/O Fencing

The same paragraph I wrote for Solaris can be copied and pasted:

Both VCS and SG use SCSI3 persistent reservation via ioctl, and I/O fencing to prevent corruption. Each node registers a key (it is the same for all the node paths). Once node membership is established, the registration keys of all the nodes that do not form part of the cluster are removed by the surviving nodes of the cluster. This blocks write access to the shared storage from evicted nodes.

April 17, 2007

RAC geographical clusters and 3rd party clusters (Sun Solaris) (1/3)

Filed under: Oracle,RAC,Solaris — christianbilien @ 9:06 pm

As a word of introduction a geographical RAC cluster is a RAC where at least one node is physically located in a remote location, and DB access is still available should one of the sites go down.

I found that many customers wishing to implement a RAC geo cluster get confused by vendors when it comes to the RAC relationships (or should I say dependencies) with third party clusters. I also have the impression that some Oracle sales rep tend to participate to this confusion by encouraging troubled prospects in one way or in another, depending of their particular interest with a hardware/cluster 3rd party provider.

Let’s first say that I am here just addressing the RAC options. Assuming some other applications need clustering services, a third party cluster will be necessary (although some provisions, still in infancy, exist within the CRS to “clusterize” non-RAC services). I’ll also deliberately not discuss NAS storage as I never had the opportunity to work or even consider a RAC/NAS option (Pillar, NetApp, and a few others are trying to get into this market).

This first post is about RAC geo clusters on Solaris. RAC geo clusters on HP-UX will be covered here.

The Solaris compatibility matrix is located at https://metalink.oracle.com/metalink/plsql/f?p=140:1:2790593111784622179

I consider two cluster areas to be strongly impacted by the “third party cluster or not” choice: storage and membership strategy. Some may also argue about private interconnect protection against failure, but since IPMP may be used for the RAC-only option, and although some technical differences exist, I think that this is a matter of much less importance that storage and membership.

Storage:

  • 10gR1 was very special as it did not have any Oracle protection for the vote and ocr volumes. This lack of functionality had a big impact on geo clusters as some third party storage clustering was required for vote and OCR mirroring.
  • 10gR2: The options may not be the same for OCR/vote, data base files, binary and archivelog files. Although archivelog files on a clustered file system saves NFS mounts, binary and archivelogs may usually be located on their own “local” file system which may on the array, but only seen from one node. The real issues are on one hand the DB files, on the other hand the OCR and voting disk which are peculiar because they must be seen when the CRS starts, BEFORE the ASM or any Oracle dependent process can be started.
  • RAC+ Sun Cluster (SCS): The storage can either be a Solaris volume manager and raw devices or QFS, GFS is not supported. ASM may be used but offers little in my opinion compared to a volume manager. ASM used for mirroring suffers from the mirroring reconstruction that has to be performed when one of site is lost and the lack of any feature similar to a copy of modified blocks only (the way storage mirroring does).
  • RAC + Veritas Cluster Services (VCS): the Veritas cluster file system (the VxFs cluster version), running over the Cluster Volume Manager (the VxVm cluster version) is certainly a good solutions for those adverse to raw device/ASM. All of the Oracle files, including OCR and vote can be put on the CFS. This is because the CFS can be brought up before the CRS starts.
  • RAC without any third party cluster: ASM has to be used for storage mirroring. This is easier to manage and cheaper, although mirrored disk group reconstruction is a concern when volumes are high. I also like not to avoid the coexistence of two clusters (RAC on top of SCS or VCS).

Membership, split brain and amnesia

A number of membership issues are addressed differently by SCS/VCS and the CRS/CSS. It is beyond the scope of this post to explain fencing, split brain and amnesia. There are really two worlds here: on one hand, Oracle has a generic clusterware membership system across platforms, which avoids system and storage dependency, on the other hand VCS and SCS take advantage of SCSI persistent reservation ioctls. Veritas and Sun both advocate that Oracle’s node eviction strategy may create situations in which a node would be evicted from the cluster, but not forced to the boot yet. Other instances may then start recovering instances while the failed instance stills write to the shared storage. Oracle says that database corruption is prevented by using the the voting disk, network, and the control file to determine when a remote node is down. This is done in different, parallel, independent ways. I am not going to enter the war on one side or another, let’s just recall the basic strategies:

  • CSS: this process uses both the interconnects and the voting disks to monitor remote node. A node must be able to access strictly more than half of the voting disks at any time (this is the reason for the odd number of voting disks), which prevents split brain. The css miscount is 30s, which is the network heartbeat time allowance for not responding before eviction.
  • Both VCS and SCS use SCSI3 persistent reservation via ioctl, and I/O fencing to prevent corruption. Each node registers a key (it is the same for all the node paths). Once node membership is established, the registration keys of all the nodes that do not form part of the cluster are removed by the surviving nodes of the cluster. This blocks write access to the shared storage from evicted nodes.

One last bit: although not a mainstream technology (and it won’t improve now that RDS over Infiniband is an option on Linux and soon on Solaris), I believe SCS is needed to allow RSM over SCI/ SunFire Link to be used. The specs show quite an impressive latency of a few micro seconds.

April 2, 2007

No age scheduling for Oracle on HP-UX: HPUX_SCHED_NOAGE (2/2)

Filed under: HP-UX,Oracle — christianbilien @ 8:31 pm

The first post on this topic was presenting the normal HP-UX timeshare scheduling, under which all processes will run. Its main objective was to give the rules for context switching (threads releasing CPUs and getting a CPU back), before comparing it to the timeshare no age scheduling.

The decaying priority scheme has not been altered since the very early HP-UX workstations, at a time when I assume data bases were a remote if not non existent consideration to the HP-UX designers.

As a thread gets weaker because of a decreasing priority, its probability of being switched out will increase. If this process holds a critical resource (an Oracle latch, not even mentioning the log writer), this context switch will cause resource contention should other threads wait for the same resource.

A context switch (you can see them with in the pswch/s column of sar -w) is a rather expensive task: the system scheduler makes a copy of all the switched out process registries and copy into a private area known as the UAREA. The reverse operation needs to be performed for the incoming thread: its UAREA will be copied in the CPU registries. Another problem will of course occurs when CPU switches also happen on top of thread switches: modified cache lines need to be invalidated and read lines reloaded on the new CPU. I once read that a context switch would cost around 10 microseconds (I did not verify it myself), which is far for negligible. The HP-UX internals manuals mentions 10000 CPU cycle, which would indeed be translated into 10 microseconds with a 1Ghz CPU.

So when does a context switch occurs? HP-UX considers two cases: forced and voluntary context switches. Within the Oracle context, voluntary context switches are most of the time disk I/Os. A forced context switch will occur at the end of a timeslice when a thread with an equal or higher priority is runnable, or when a process returns from a system call or trap and a higher-priority thread is runnable.

Oradebug helps diagnosing the LGWR forced and volontary switches:

SQL> oradebug setospid 2022
Oracle pid: 6, Unix process pid: 2022, image: oracle@mymachine12 (LGWR)
SQL> oradebug unlimit
Statement processed.
SQL> oradebug procstat
Statement processed.
SQL> oradebug tracefile_name
/oracle/product/10.2.0/admin/MYDB/bdump/mydb_lgwr_2022.trc
SQL> !more /oracle/product/10.2.0/admin/MYDB/bdump/mydb_lgwr_2022.trc

Voluntary context switches = 272
Involuntary context switches = 167
….


The sched no age policy was added in HP-UX11i: it is part of the 178-225 range (the same range as the normal time share scheduler user priorities), but a thread running in this scheduler will not experience any priority decay: its priority will remain constant over time although the thread may be timesliced. Some expected benefits:

  • Less context switches overhead (lower cpu utilization)
  • The cpu hungry transactions may complete faster in a CPU busy system (the less hungry transactions may run slower, but as they do not spend much time on the CPUs, it may not be noticeable).
  • May help resolve an LGWR problem: the log writer inherently experiences a lot of voluntary context switches, which helps keeping a high priority. However, it may not be enough on CPU-busy systems.

As the root user, give the RTSCHED and RTPRIO privileges to the dba group setprivgrp dba RTSCHED RTPRIO. Create the /etc/privgroup file, if it does not exist, and add the following line to it: dba RTSCHED RTPRIO. Finally update the spfile/init with hpux_sched_noage=178 to 255. See rtsched(1).

Create a free website or blog at WordPress.com.