As a word of introduction a geographical RAC cluster is a RAC where at least one node is physically located in a remote location, and DB access is still available should one of the sites go down.
I found that many customers wishing to implement a RAC geo cluster get confused by vendors when it comes to the RAC relationships (or should I say dependencies) with third party clusters. I also have the impression that some Oracle sales rep tend to participate to this confusion by encouraging troubled prospects in one way or in another, depending of their particular interest with a hardware/cluster 3rd party provider.
Let’s first say that I am here just addressing the RAC options. Assuming some other applications need clustering services, a third party cluster will be necessary (although some provisions, still in infancy, exist within the CRS to “clusterize” non-RAC services). I’ll also deliberately not discuss NAS storage as I never had the opportunity to work or even consider a RAC/NAS option (Pillar, NetApp, and a few others are trying to get into this market).
This first post is about RAC geo clusters on Solaris. RAC geo clusters on HP-UX will be covered here.
The Solaris compatibility matrix is located at https://metalink.oracle.com/metalink/plsql/f?p=140:1:2790593111784622179
I consider two cluster areas to be strongly impacted by the “third party cluster or not” choice: storage and membership strategy. Some may also argue about private interconnect protection against failure, but since IPMP may be used for the RAC-only option, and although some technical differences exist, I think that this is a matter of much less importance that storage and membership.
- 10gR1 was very special as it did not have any Oracle protection for the vote and ocr volumes. This lack of functionality had a big impact on geo clusters as some third party storage clustering was required for vote and OCR mirroring.
- 10gR2: The options may not be the same for OCR/vote, data base files, binary and archivelog files. Although archivelog files on a clustered file system saves NFS mounts, binary and archivelogs may usually be located on their own “local” file system which may on the array, but only seen from one node. The real issues are on one hand the DB files, on the other hand the OCR and voting disk which are peculiar because they must be seen when the CRS starts, BEFORE the ASM or any Oracle dependent process can be started.
- RAC+ Sun Cluster (SCS): The storage can either be a Solaris volume manager and raw devices or QFS, GFS is not supported. ASM may be used but offers little in my opinion compared to a volume manager. ASM used for mirroring suffers from the mirroring reconstruction that has to be performed when one of site is lost and the lack of any feature similar to a copy of modified blocks only (the way storage mirroring does).
- RAC + Veritas Cluster Services (VCS): the Veritas cluster file system (the VxFs cluster version), running over the Cluster Volume Manager (the VxVm cluster version) is certainly a good solutions for those adverse to raw device/ASM. All of the Oracle files, including OCR and vote can be put on the CFS. This is because the CFS can be brought up before the CRS starts.
- RAC without any third party cluster: ASM has to be used for storage mirroring. This is easier to manage and cheaper, although mirrored disk group reconstruction is a concern when volumes are high. I also like not to avoid the coexistence of two clusters (RAC on top of SCS or VCS).
Membership, split brain and amnesia
A number of membership issues are addressed differently by SCS/VCS and the CRS/CSS. It is beyond the scope of this post to explain fencing, split brain and amnesia. There are really two worlds here: on one hand, Oracle has a generic clusterware membership system across platforms, which avoids system and storage dependency, on the other hand VCS and SCS take advantage of SCSI persistent reservation ioctls. Veritas and Sun both advocate that Oracle’s node eviction strategy may create situations in which a node would be evicted from the cluster, but not forced to the boot yet. Other instances may then start recovering instances while the failed instance stills write to the shared storage. Oracle says that database corruption is prevented by using the the voting disk, network, and the control file to determine when a remote node is down. This is done in different, parallel, independent ways. I am not going to enter the war on one side or another, let’s just recall the basic strategies:
- CSS: this process uses both the interconnects and the voting disks to monitor remote node. A node must be able to access strictly more than half of the voting disks at any time (this is the reason for the odd number of voting disks), which prevents split brain. The css miscount is 30s, which is the network heartbeat time allowance for not responding before eviction.
- Both VCS and SCS use SCSI3 persistent reservation via ioctl, and I/O fencing to prevent corruption. Each node registers a key (it is the same for all the node paths). Once node membership is established, the registration keys of all the nodes that do not form part of the cluster are removed by the surviving nodes of the cluster. This blocks write access to the shared storage from evicted nodes.
One last bit: although not a mainstream technology (and it won’t improve now that RDS over Infiniband is an option on Linux and soon on Solaris), I believe SCS is needed to allow RSM over SCI/ SunFire Link to be used. The specs show quite an impressive latency of a few micro seconds.