Christian Bilien’s Oracle performance and tuning blog

January 19, 2008

Oracle’s clusterware real time priority oddity

Filed under: Oracle,RAC — christianbilien @ 6:16 pm


The CSS processes use both the interconnect and the voting disks to monitor remote node. A node must be able to access strictly more than half of the voting disks at any time (this is the reason for the odd number of voting disks), which prevents split brain. Let’s just recall that split brains are encountered when several cluster “islands” are created without being aware of each other. Oracle uses a modified STONITH (Shoot The Other Node In The Head) algorithm, although instead of being able to fail other nodes, one node can merely instruct the other nodes to commit suicide.

This subtlety has far reaching consequences: the clusterware software on each node MUST be able to coordinate its own actions in any case without relying upon the other nodes. There is an obvious potential problem when the clusterware processes cannot get the CPU in a timely manner, especially as a lot of the cssd code is running in user mode. This can be overcome by raising the cssd priority, something which was addressed by the 10.2.0.2 release by setting the css priority in the ocr registry:

crsctl set css priority 4

Meaning of priority 4
You’ll see on Solaris in /etc/init.d/init.cssd the priority boost mechanism which corresponds to the values you can pass to crsctl set css priority:

PRIORITY_BOOST_DISABLED=0
PRIORITY_BOOST_LOW=1
PRIORITY_BOOST_MID=2
PRIORITY_BOOST_HIGH=3
PRIORITY_BOOST_REALTIME=4
PRIORITY_BOOST_RENICE_LOW=-5
PRIORITY_BOOST_RENICE_MID=-13
PRIORITY_BOOST_RENICE_HIGH=-20
PRIORITY_BOOST_RENICE_REALTIME=0
PRIORITY_BOOST_ENABLED=1
PRIORITY_BOOST_DEFAULT=$PRIORITY_BOOST_HIGH

A bit further down the file:

RTGPID=’/bin/priocntl -s -c RT -i pgid’

Further down:

  if [ $PRIORITY_BOOST_ENABLED = '1' ]; then
    NODENAME=`$CRSCTL get nodename`     # check to see the error codes
    case $? in
    0)
      # since we got the node name, now try toget the actual
      # boost value for the node
     PRIORITY_BOOST_VALUE=`$CRSCTL get css priority node $NODENAME` => retrieves the PRIORITY_BOOST_VALUE

Still further down:

    case $PRIORITY_BOOST_VALUE in
      $PRIORITY_BOOST_LOW)
        # low priority boost
        $RENICE $PRIORITY_BOOST_RENICE_LOW -p $$
        ;;
      $PRIORITY_BOOST_MID)
        # medium level boost
        $RENICE $PRIORITY_BOOST_RENICE_MID -p $$
        ;;
      $PRIORITY_BOOST_HIGH)
        # highest level normal boost
        $RENICE $PRIORITY_BOOST_RENICE_HIGH -p $$
        ;;
      $PRIORITY_BOOST_REALTIME)
        # realtime boost only should be used on platforms that support this
        $RTGPID $$ => realtime 
;;

So setting a priority of 4 should set the current shell and its children to RT using priocntl.

… but the cssd daemons do not run under RT

I noticed this oddity in the 10.2.0.2 release on Solaris 8 following a RAC node reboot under a (very) heavy CPU and memory load. There is in the trace file:

[ CSSD]2008-01-03 18:49:04.418 [11] >WARNING: clssnmDiskPMT: sltscvtimewait timeout (282535)
[ CSSD]2008-01-03 18:49:04.428 [11] >TRACE: clssnmDiskPMT: stale disk (282815 ms) (0//dev/vx/rdsk/racdg/vote_vol)
[ CSSD]2008-01-03 18:49:04.428 [11] >ERROR: clssnmDiskPMT: 1 of 1 voting disk unavailable (0/0/1)

This is a timeout after a 282,535s polling on the voting disk. IO errors were neither reported in /var/adm/messages nor by the storage array.

The priority is unset by default:

$crsctl get css priority
Configuration parameter priority is not defined.
 
$crsctl set css priority 4
Configuration parameter priority is now set to 4.

This is written into the ocr registry (from ocrdump):

[…]
[SYSTEM.css.priority]
UB4 (10) : 4
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}
[…]

Look at the priorities:

$/crs/10.2.0/bin # ps -efl | grep cssd | grep –v grep| grep –v fatal
 0 S     root  1505  1257   0  40 20        ?    297        ?   Dec 27 ?           0:00 /bin/sh /etc/init.d/init.cssd oproc
 0 S     root  1509  1257   0  40 20        ?    298        ?   Dec 27 ?           0:00 /bin/sh /etc/init.d/init.cssd oclso
 0 S     root  1510  1257   0  40 15        ?    298        ?   Dec 27 ?           0:00 /bin/sh /etc/init.d/init.cssd daemo
 0 S   oracle  1712  1711   0  40 15        ?   9972        ?   Dec 27 ?          71:52 /crs/10.2.0/bin/ocssd.bin

the second column in red shows the nice value, which has been amended from 20 to 15, but 40 is still a TimeShare priority, making init.cssd and ocssd.bin less prone to preemption but still exposed to high loads.

oprocd is the only css process to run in Real Time, whatever the priority setting. Its purpose is to detect system hangs

ps -efl| grep oprocd
0 S     root 27566 27435   0   0 RT        ?    304       ?   Nov 17 console     9:49 /crs/10.2.0/bin/oprocd.bin run -t 1

The css priorities remained unchanged after stopping and restarting the crs. Part of the above behavior is described in Metalink note:4127644.8 where it is filed as a bug. I did not test crsctl set css priority on a 10.2.0.3. The priorities are identical to what they are on 10.2.0.2 when unset.

December 24, 2007

11g ASM preferred reads for RAC extended clusters

Filed under: Oracle,RAC — christianbilien @ 10:14 am

I expressed this summer my delight over the 11g fast resync option for extended (geographical) RAC clusters, for which the mirrored failure groups have to be on different sites, like having one of the failure group on array 1 on site 1 and the mirrored copy on array 2 on site 2. The major 10g limitations are twofold in this area:

  • The loss of an array means the mandatory reconstruction of the whole data base.
  • Failgroups have to be reconstructed serially (no more than one rebalance activity per node, per ASM group).

This is really painful as any scheduled shutdown of one array while the other one is still active is analogous to an outage. Imagine what it takes to rebuild 100TB.

Normal redudancy (mirrored) reconstruction is much slower than either the SAN or the storage layout allows (there is room for “research” here – the best I have seen so far is 150GB/Hrs, that’s 43 MB/s on 64 disks 15000 rpm on the highest DMX storage array. The FC ports where 2x 4Gb/s).

The second problem with RAC 10g extended clusters is linked to the way the ASM storage allocator will create primary and mirrored extent. In a disk group made of two failgroups, notions of “primary fail group” and “secondary fail group” will not exist. Each failgroup contains an equal number of primary and mirrored copies. Let’s take the example of a diskgroup made of two failgroups, each on a different storage array. For each of the storage arrays, here is the number of primary and mirrored copies for a given group (it is made of just one datafile here):

We’ll use the X$KFFXP table. The columns of interest here are:

GROUP_KFFXP : disk group number
LXN_KFFXP: 0: primary extent/ 1: mirrored extent
DISK_KFFXP: disk where the extent has been created
NUMBER_KFFXP : ASM file number

select disk_kffxp, lxn_kffxp, count(*) from x$kffxp x, v$asm_file v
where group_kffxp=1 and x.number_kffxp=v.file_number and xnum_kffxp!=65534
group by disk_kffxp, lxn_kffxp;

 

 DISK_KFFXP  LXN_KFFXP   COUNT(*)
 1          0       1050

 1          1       1050

 2          0       1050

 2          1       1050

The number of primary and mirrored extents is balanced on each of the disks of group 1. The logic is that whilst write access will equally hit both types of extents, reads will only access the primary extents. It makes sense to spread primary extents over the two failure groups to maximize throughput when the two failure groups are on an equal foot: same throughput and same I/O response time.

But what happens when the boxes are in different locations? As each instance will read on both arrays, the distance factor will introduce delays on half of the reads. The delay is a function of the distance, but we also need to take into account that the number of storage hops the reads must traverse will increase when remote sites are to be reached. Even worse, the inter sites links are bound to congestion as they must accommodate oversubscription and link contention.

Whenever possible, a clever use of a RAC data locality will reduce the RAC interconnect burden and reduce reliance on cache fusion. It can also in 11G reduce the inter site link throughput requirements and possibly speed up reads by reading data locally, assuming the local array is not itself overloaded.

The preferred read is an instance wide initialization parameter:

ASM_PREFERRED_READ_FAILURE_GROUPS = diskgroup_name.failure_group_name,…

ASM extent layout reference: Take a look at Luca Canali’s work: a peek inside Oracle ASM metadata

— Updated: check the comment below for Luca’s updated links – very interesting indeed —

December 13, 2007

One of Pavlov’s dogs (2/2)

Filed under: Oracle,RAC,Solaris — christianbilien @ 9:15 pm

I didn’t get much chance with the Pavlov’s dogs challenge : not even a single attempt at an explanation. Maybe it was too weird a problem (or more embarrassing it did not interest anyone! ).

A quick reminder about the challenge: netstat shows almost no activity on an otherwise loaded interconnect (500Mb/s to 1Gb/s inbound and similar values going outbound as seen on the Infiniband switches and calculated as the product of PX remote messages recv’d/send x parallel_execution_message_size).

Well, anyway, here is what I think is the answer: the key information I gave was that the clusterware was running RDS over Infiniband. Infiniband HCAs have an inherent advantage over standard Ethernet network interfaces: they embed RDMA, which means that all operations are handled without interrupting the CPUs. That’s because the sending nodes read and write to the receiving node using user space memory, without going through the usual I/O channel. TCP/IP NICs also cause a number of interrupts the CPUs have to process because TCP segments have to be reconstructed while other threads are running.

The most likely cause of the netstat blindness is just that it cannot see the packets because the CPUs are unaware of them.

To quote the Wikipedia Pavlov’s dogs article, “the phrase “Pavlov’s dog” is often used to describe someone who merely reacts to a situation rather than use critical thinking”. That’s exactly what I thought of myself when I was trying to put the blame on the setup instead of thinking twice about the “obvious” way of measuring a network packet throughput.

December 9, 2007

One of Pavlov’s dogs (1/2)

Filed under: Oracle,RAC,Solaris — christianbilien @ 7:44 pm

That’s what I thought of myself after spending an hour trying to figure out what I had done wrong in my setup.

So here is a little challenge I derived from this experience:

1. I set up a 5 node 10gR2 RAC over a mix of 5 Xeon and Opteron 4 cores. The interconnect is RDS over an HCA Infiniband. The setup was made on OpenSolaris. I know, RAC is not a supported on OpenSolaris but we were advised by Sun MicroSystems that the Infiniband layer was significantly faster on OpenSolaris that it was on Solaris 10. Some ftp tests indeed showed that even IPoIB was 20% faster on OpenSolaris than on Solaris 10. So I had to tweak the Oracle installer and could not use any of the Oracle Guis, but I got it working.

2. select /*+ full(c) full(o) */ * from soe.product_descriptions c,soe.product_information o where c.product_id=o.product_id and c.product_id>300000;

was run simultaneously 10 times on each instance (this is incidentally a data base generated by Swingbench ). The DOP varies from 10 to 40, but it does not make much difference as far as our challenge is concerned.

I then plotted the ‘PX remote messages recv’d’ and ‘PX remote messages sent’.

Parallel_execution_message_size=8192

I could get the following peak throughput (5s interval) by multiplying parallel_execution_message_size by the PX remote messages figures:

PX remote messages recv’d Mb/s PX remote messages sent Mb/s

792,82

1077,22

1035,78

756,21

697,01

885,18

I am not taking into account the GES and the GCS messages, nor did I count the cache fusion blocks. Both of them were small anyway. The weird thing came when I tried to measure the corresponding HCA Input / Output packet number from netstat:

Input Packets/s Output Packets/s

69

73

39

35

58

6

Almost no traffic on the interconnect (the HCA MTU is 2044 bytes) !

Let’s check that the interconnect is running over the intended HCA:

netstat -ni

Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs CollisQueue
lo0 8232 127.0.0.0 127.0.0.1 32364353 0 32364353 0 0
ibd0 2044 10.0.0.0 10.0.0.1 27588368 0 26782527 0 0
nge1 1500 10.x.x.x 10.x.x.x 640710585 0 363462595 0 0

SQL> oradebug setmypid
SQL> oradebug ipc
SQL> oradebug tracefile_name

[…]
SSKGXPT 0x64362c8 flags SSKGXPT_READPENDING socket no 8 IP 10.0.0.1
[…]

Just to be sure:

SQL> select indx,picked_ksxpia,ip_ksxpia from x$ksxpia;
INDX PICKED_KSXPIA IP_KSXPIA
————————————————————-
0 CI 10.0.0.1

Question: What could possibly justify the discrepancies between the netstat output and the values collected from v$px_sesstat ?

Hints: it is not a bug

 

September 14, 2007

Strategies for parallelized queries across RAC instances (part 2/2)

Filed under: Oracle,RAC — christianbilien @ 5:52 am

If you have not done so, it would be beneficial to first go through the first post “Strategies for parallelized queries across RAC instances” to get an understanding of the concepts and challenge. I came up with a description of an asymmetric strategy for the handling of RAC inter-instance parallelized queries but still being able to force both the slaves and the coordinator to reside on just one node. The asymmetric configuration is a connexion and service based scheme which allow users

  • connected to node1 to execute both the coordinator and slaves on node1, and prevent the slaves to spill on node2
  • connected to node2 but unable to do an alter session (because the code belongs to an external provider) to load balance their queries across the nodes
  • connected to node 1 and able to issue an alter session to load balance.

Another way of doing things is to load balance the default service to which the applications connect to, and to restrict access to node1 and to node2. This is a symmetric configuration.

Load balancing:

Tnsnames and service names are the same as in the asymmetric configuration.

bothnodes =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = myvip1)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = myvip2)(PORT = 1521))
)
(CONNECT_DATA =
(SERVICE_NAME = bothnodestaf)
(FAILOVER_MODE =
(TYPE = SELECT)
(METHOD = BASIC)
(RETRIES = 180)
(DELAY = 5)
)
(SERVER = DEDICATED)
)
)

execute dbms_service.modify_service (service_name => ‘bothnodestaf’ –
, aq_ha_notifications => true –
, goal => dbms_service.goal_throughput –
, failover_method => dbms_service.failover_method_basic –
, failover_type => dbms_service.failover_type_select –
, failover_retries => 180 –
, failover_delay => 5 –
, clb_goal => dbms_service.clb_goal_short);

Both spfiles were changed:

On node1, spfileI1.ora contains:

INSTANCE_GROUPS =’IG1’,’IGCLUSTER’

PARALLEL_INSTANCE_GROUP=’IGCLUSTER’

On node2, spfileI2.ora contains:

INSTANCE_GROUPS =’IG2’, ’IGCLUSTER’

PARALLEL_INSTANCE_GROUP=’IGCLUSTER’


select inst_id, name, value from gv$parameter where name like ‘%instance%group%’;

INST_ID NAME VALUE

———- ——————————————————————————-

1 instance_groups IG1, IGCLUSTER

1 parallel_instance_group IGCLUSTER

2 instance_groups IG2, IGCLUSTER

2 parallel_instance_group IGCLUSTER

By default, both nodes will behave in the same way. As the PARALLEL_INSTANCE_GROUP matches one of the INSTANCE_GROUPS on both nodes, load balancing will work by default whatever the node on which the applications connects.

On node 1:

$sqlplus mylogin/mypass@bothnodes

 

< no alter session set parallel_instance_group= >

select /*+ full(orders_part) */ count(*) from orders_part;

QCSID Inst Group Set PROGRAM

———- ———- ———- ———- ————————————————
1050 1 1 1 oracle@node1 (P010)
1050 1 1 1 oracle@node1 (P011)
1050 1 1 1 oracle@node1 (P009)
1050 1 1 1 oracle@node1 (P008)
1050 2 1 1 oracle@node2 (P009)
1050 2 1 1 oracle@node2 (P011)
1050 2 1 1 oracle@node2 (P010)
1050 2 1 1 oracle@node2 (P008)
1050 1 sqlplus@node1 (TNS V1-V3)

On node 2:

$ sqlplus mylogin/mypass@bothnodes

< no alter session set parallel_instance_group= >

select /*+ full(orders_part) */ count(*) from orders_part;

QCSID Inst Group Set PROGRAM
———- ———- ———- ———- ————————————————
997 1 1 1 oracle@node1 (P010)
997 1 1 1 oracle@node1 (P011)
997 1 1 1 oracle@node1 (P009)
997 1 1 1 oracle@node1 (P008)
997 2 1 1 oracle@node2 (P009)
997 2 1 1 oracle@node2 (P011)
997 2 1 1 oracle@node2 (P010)
997 2 1 1 oracle@node2 (P008)
997 2 sqlplus@node2 (TNS V1-V3)

You may notice a subtle difference: the coordinator runs as expected on the node the service connects to.

Node restriction:

The tnsnames.ora and the service definition are left unchanged from the tests performed in the previous post. nodetaf1 is set to node1= ‘PREFERRED’ and node2=’NONE’

On node1:

Tnsnames.ora: (unchanged)

node1-only =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = node1)(PORT = 1521))
)
(LOAD_BALANCING=NO)
(CONNECT_DATA =
(SERVICE_NAME= nodetaf1)
(SID = EMDWH1)
(SERVER = DEDICATED)
)
)

$ sqlplus mylogin/mypass@node1-only

alter session set parallel_instance_group=’IG1′;

select /*+ full(orders_part) */ count(*) from orders_part;

QCSID Inst Group Set PROGRAM
———- ———- ———- ———- ————————————————
984 1 1 1 oracle@node1 (P000)
984 1 1 1 oracle@node1 (P004)
984 1 1 1 oracle@node1 (P002)
984 1 1 1 oracle@node1 (P005)
984 1 1 1 oracle@node1 (P006)
984 1 1 1 oracle@node1 (P001)
984 1 1 1 oracle@node1 (P007)
984 1 1 1 oracle@node1 (P003)
984 1 sqlplus@node1 (TNS V1-V3)

On node 2:

The tnsnames.ora and service definition are symmetric to what they are on node1

$ sqlplus mylogin/mypass@node2-only

alter session set parallel_instance_group=’IG2′;

select /*+ full(orders_part) */ count(*) from orders_part;

QCSID Inst Group Set PROGRAM

———- ———- ———- ———- ————————————————
952 1 1 1 oracle@node2 (P000)
952 1 1 1 oracle@node2 (P004)
952 1 1 1 oracle@node2 (P002)
952 1 1 1 oracle@node2 (P005)
952 1 1 1 oracle@node2 (P006)
952 1 1 1 oracle@node2 (P001)
952 1 1 1 oracle@node2 (P007)
952 1 1 1 oracle@node2 (P003)
952 1 sqlplus@node2 (TNS V1-V3)

September 12, 2007

Strategies for RAC inter-instance parallelized queries (part 1/2)

Filed under: Oracle,RAC — christianbilien @ 8:32 pm

I recently had to sit down and think about how I would spread workloads across nodes in a RAC data warehouse configuration. The challenge was quite interesting, here were the specifications:

  • Being able to fire PQ slaves on different instances in order to use all available resources for most of the aggregation queries.
  • Based on a tnsname connection, being able to restrict PQ access to one node because external tables may be used. Slaves running on the wrong node would fail :

insert /*+ append parallel(my_table,4) */ into my_table
ERROR at line 1:
ORA-12801: error signaled in parallel query server P001, instance node-db04:INST2 (2)
ORA-29913: error in executing ODCIEXTTABLEOPEN callout
ORA-29400: data cartridge error
KUP-04040: file myfile408.12.myfile_P22.20070907.0.txt in MYDB_INBOUND_DIR not found
ORA-06512: at “SYS.ORACLE_LOADER”, line 19

I could come up with two basic configurations based on combinations of services, INSTANCE_GROUPS and PARALLEL_INSTANCE_GROUP. This post is about an asymmetric configuration, the second will deal with a symmetric configuration.

Parallel executions are services aware: the slaves inherit the data base service from the coordinator. But the coordinator may execute slaves on any instance of the cluster, which means that the service localization (‘PREFERRED’ instance) will create the coordinator on the henceforth designed instance, but the slaves are not bound by the service configuration. So I rewrote “being able to restrict PQ access to one node” into “being able to restrict PQ access and the coordinator to one node”

An instance belongs to the instance group it declares in its init.ora/spfile.ora. Each instance may have several instance groups declared: INSTANCE_GROUPS =’IG1’,’IG2’ (be careful not to set ‘IG1,IG2’). PARALLEL_INSTANCE_GROUP is another parameter which can be either configured at the system or at session level. Sessions may fire PQ slaves on instances for which the parallel instance group matches and instance group the instance belongs to.

Let’s consider a 2 nodes configuration: node1 and node2. I1 and I2 are instances respectively running on node1 and node2.
The Oracle Data Warehouse documentation guide states that a SELECT statement can be parallelized for objects schema created with a PARALLEL declaration only if the query involves either a full table table scan or an inter partition index range scan. I’ll force a full table scan in my tests case to have the CBO decide to pick up a parallel plan:

select /*+ full(orders_part)*/ count(*) from orders_part;

The configuration described below will allow users :

  • connected to node1 to execute both the coordinator and slaves on node1, and prevent the slaves to spill on node2
  • connected to node2 but unable to do an alter session (because the code belongs to an external provider) to load balance their queries across the nodes
  • connected to node 1 and able to issue an alter session to load balance.

In a nutshell, users connecting to node1 will restrict their slave scope to node1, users connecting to node2 will be allowed to load balance their slaves over all the nodes.

I set for the test parallel_min_servers=0: I can then see the slaves starting whenever Oracle decides to fire them.

Asymmetric INSTANCE_GROUPS configuration:

On node1, spfileI1.ora looks like:

INSTANCE_GROUPS =’IG1’,’IG2’
PARALLEL_INSTANCE_GROUP=’IG1’

On node2, spfileI2.ora contains:

INSTANCE_GROUPS =’IG2’
PARALLEL_INSTANCE_GROUP=’IG2’

select inst_id, name, value from gv$parameter where name like ‘%instance%group%’;

INST_ID NAME VALUE

———- ——————————————————————————-

1 instance_groups IG1, IG2
1 parallel_instance_group IG1
2 instance_groups IG2
2 parallel_instance_group IG2

Single node access

I configured via dbca nodetaf1, a service for which node1/instance I1 was the preferred node and node was set to “not use” (I do not want to connect on one node an execute on another – thereby unnecessarily clogging the interconnect —) and did the same for node2/instance I2.

SQL> select inst_id,name from gv$active_services where name like ‘%taf%’;

INST_ID NAME

———- —————————————————————-

2 nodetaf2
1 nodetaf1

The related TNSNAMES entries for a node1 only access looks like:

Node1-only =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = node1)(PORT = 1521))

)

(LOAD_BALANCING=NO)
(CONNECT_DATA =
(SERVICE_NAME= nodetaf1)
(SID = EMDWH1)
(SERVER = DEDICATED)

)
)

Note that the host name is not the vip address (because there is no point in switching node should node1 fail).

Test results:

$ sqlplus mylogin/mypass@node1-only

< no alter session set parallel_instance_group= >
select /*+ full(orders_part) */ count(*) from orders_part;

select qcsid, p.inst_id “Inst”, p.server_group “Group”, p.server_set “Set”,s.program
from gv$px_session p,gv$session s
where s.sid=p.sid and qcsid=&1
order by qcinst_id , p.inst_id,server_group,server_set

QCSID Inst Group Set PROGRAM
———- ———- ———- ———- ————————————————

938 1 1 1 oracle@node1 (P000)
938 1 1 1 oracle@node1 (P004)
938 1 1 1 oracle@node1 (P002)
938 1 1 1 oracle@node1 (P005)
938 1 1 1 oracle@node1 (P006)
938 1 1 1 oracle@node1 (P001)
938 1 1 1 oracle@node1 (P007)
938 1 1 1 oracle@node1 (P003)
938 1 sqlplus@node1 (TNS V1-V3)

The coordinator and the slaves stay on node1

Dual node access

I then added a service aimed at firing slave executions on both nodes. The ‘bothnodestaf’ was added using dbca and then modified to give it “goal_throughput” and “clb_goal_short” load balancing advisories: according to the documentation, load balancing is based on rate that works is completed in service plus available bandwidth. I’ll dig into that one day to get a better understanding of the LB available strategies.

execute dbms_service.modify_service (service_name => ‘bothnodestaf’ –
, aq_ha_notifications => true –
, goal => dbms_service.goal_throughput –
, failover_method => dbms_service.failover_method_basic –
, failover_type => dbms_service.failover_type_select –
, failover_retries => 180 –
, failover_delay => 5 –
, clb_goal => dbms_service.clb_goal_short);

In the tnsnames.ora:

bothnodes =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = myvip1)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = myvip2)(PORT = 1521))

)

(CONNECT_DATA =
(SERVICE_NAME = bothnodestaf)
(FAILOVER_MODE =
(TYPE = SELECT)
(METHOD = BASIC)
(RETRIES = 180)
(DELAY = 5)

)
(SERVER = DEDICATED)
)
)

Note that LOAD_BALANCE is unset (set to OFF) in the tnsnames.ora to allow server-side connection balancing

On node1:

$ sqlplus mylogin/mypass@bothnodes

alter session set parallel_instance_group=’IG2′;
select /*+ full(orders_part) */ count(*) from orders_part;

QCSID Inst Group Set PROGRAM

———- ———- ———- ———- ————————————————

936 1 1 1 oracle@node1 (P002)
936 1 1 1 oracle@node1 (P003)
936 1 1 1 oracle@node1 (P001)
936 1 1 1 oracle@node1 (P000)
936 1 1 1 oracle@node2 (P037)
936 1 1 1 oracle@node2 (P027)
936 2 1 1 oracle@node1 (O001)
936 2 1 1 oracle@node2 (P016)
936 2 1 1 oracle@node2 (P019)
936 2 1 1 oracle@node2 (P017)
936 2 1 1 oracle@node2 (P018)
936 1 sqlplus@node1 (TNS V1-V3)

The slaves started on both nodes.

On node2:

$ sqlplus mylogin/mypass@bothnodes
< no alter session set parallel_instance_group= >
select /*+ full(orders_part) */ count(*) from orders_part;

QCSID Inst Group Set PROGRAM

———- ——————————————————————————-

952 1 1 1 oracle@node1 (P002)
952 1 1 1 oracle@node1 (P003)
952 1 1 1 oracle@node1 (P001)
952 1 1 1 oracle@node1 (P000)
952 1 1 1 oracle@node2 (P037)
952 1 1 1 oracle@node2 (P027)
952 2 1 1 oracle@node1 (O001)
952 2 1 1 oracle@node2 (P016)
952 2 1 1 oracle@node2 (P019)
952 2 1 1 oracle@node2 (P017)
952 2 1 1 oracle@node2 (P018)
952 1 sqlplus@node1 (TNS V1-V3)

Again, connecting on a node2 services allows load balancing between the nodes

 

 

August 31, 2007

Migrating large, redo intensive databases to RAC using Data Guard

Filed under: Oracle,RAC — christianbilien @ 7:21 am

Here are some thoughts and a couple of issues raised by the handling of a 10TB data base migration. This 10gR2 data warehouse had to be migrated from a non RAC, Solaris UFS file system environment towards a 3 nodes ASM RAC. What makes this move a not-so-easy endeavor is not primarily its size, but its redo log generation rate of 1GB every minute 24 hrs/ 7 days a week, though some loading is done in NO LOGGING mode. Storage replication is not an option here because of distance and budget. So I went to Data Guard, and found out a couple of things worth sharing:

Data base duplication

The key issue I tried to workaround was the estimated 30 hours of full backup/restore coupled to the 1.8TB of redo log to transfer generated in this time frame I would have to recover on the standby data base. The archived logs were furthermore backed up and purged every 3hrs to make room.

I found this little trick to trigger the Remote File Server (RFS) to start writing on the target standby platform even BEFORE the backup process starts. Here are the steps:

  1. After generating a standby control file, I just started mount a dummy standby database on the target server. That’s enough for the RFS to start the log file copy.
  2. I ran the rman backup on the source
  3. When the backup was ready, I had to bounce in nomount mode the target DB to allow an rman “duplicate for standby”.

    The alert file shows the RFS every 5’ the RFS restart attempts :

    PING[ARC0]: Heartbeat failed to connect to standby ‘RACSTANDBY’. Error is 12528.

  4. This is where it gets interesting, because the “duplicate for standby” mounts the standby data base, allowing the RFS to restart. The rman memory script shows :

{
set until scn 19415505108;
restore clone standby controlfile;

sql clone ‘alter database mount standby database’;
}

and into the alert file on the target system:

Redo Shipping Client Connected as PUBLIC
— Connected User is Valid
RFS[1]: Assigned to RFS process 27224
RFS[1]: Identified database type as ‘physical standby’

Caveat:

  1. Allow enough space for the copied log file on the target to allow the restoration and the recovery to complete.
  2. A Log archive gap was created in step 3. rman mounted the standby and started copying but some archive log files had to be manually copied. (Although the unchanged FAL configuration worked fine after the data guard setup).

 

ASM restoration bug

I used Metalink note 273015.1 as a starting point. You have to add to the primary (non RAC data base) 2 new logfile threads, 2 additional undo tablespaces for the second and third RAC instances and run catclust.sql. What this note does not say is the

ORA-00600: internal error code, arguments: [kjrfr7], [], [], [], [], [], [], []

you get in the ASM instance when you attempt to restore the data base (usually after several hours – grrr -). This came down as the unpublished bug 4750469, fixed in 10.2.0.3 which only went out on August 13th on Solaris x86-64, the platform I was working on. The bug can occur when oracle does a sanity checking of a resource whose memory has been released to the shared pool. I worked around the bug by “unclustering” the DB (cluster_database =false) both at the ASM and at the DB level during the restoration phase, and then put back the cluster_database to true.


July 17, 2007

Hooray for the 11g ASM: “Fast” Mirror Resync at last !

Filed under: Oracle,RAC — christianbilien @ 9:05 pm

Forgive me if I sound over-enthusiastic: I already mentioned in RAC geo clusters on HP-UX and in RAC geo clusters on Solaris how annoying the absence of incremental mirror rebuild was to the ASM based RAC geo clusters. As a matter of fact, the fact that a full rebuild of the mirror structure is needed writes off this architecture for data bases over a few hundreds GB. Here is a very common situation to me: you have a 2 sites RAC/ASM based geo clusters made of one storage array and one node on each site. Site A has to be taken down for cooling maintenance (actually sites are taken down at least 5 times a year), but the surviving site B must be kept up to preserve the week end operations. The second site ASM instances on site B are gracefully taken down, but all of the mirrored ASM failure groups have all to be fully rebuilt when they are brought up. The more databases, the more terabytes have to move around.

An actual outage is a more dramatic opportunity for trouble: assuming the clusters transfer the applications loads on your backup site, you nonetheless have to wait for the outage cause to be fixed plus the ASM rebuild time to be back to normal. Just prey you don’t have a second outage on the backup site while the first site is busy rebuilding its ASM failure groups.

The consequences of this single ASM weakness reach as far as having to use a third party cluster on a SAN just to be able to make use of the VxVM Dirty Logging Region for large databases. Having to make a “strategic” decision (3rd party cluster or not) on such a far reaching choice solely based on this is to me a major annoyance.

There are a few promising words in the Oracle 11g new features area posted on the OTN sites about a “Fast Mirror Resync” in Automatic Storage Management New Features Overview which should just be the long awaited “DLR” ASM rebuild feature. ASM can now resynchronize the extents that have been modified during the outage. It also looks like a failure group has a DISK_REPAIR_TIME attribute that defines a window in which the failure can be repaired and the mirrored failure group storage array be brought on-line, after which and “ALTER DISKGROUP DISK ONLINE” will starts the process of resynchronization. What happens if you exceed DISK_REPAIR_TIME is not said.

April 19, 2007

RAC geographical clusters and 3rd party clusters (HP-UX) (2/3)

Filed under: HP-UX,Oracle,RAC — christianbilien @ 12:38 pm

This is the second post in a series of 3 related to geographical clusters (the first post was focusing on Solaris RAC, the last will be about cluster features such as fencing, quorum, cluster lock, etc.). It would be beneficial to read the first geo cluster post which aimed at Solaris as many issues, common to HP-UX and Solaris, will not be rehashed.

Introduction

Solaris: to make my point straight, I think that the two major differences between the group of generic clusters Sun Cluster Services/ Veritas Cluster Services and the Oracle CRS clusterware are storage and membership. Their nature is however not the same: the storage chapter is essentially related to available options, with ASM being used as the sole option for mirroring in CRS only clusters. This may be a major drawback for large databases. Membership is a different story: to put it simply, Sun says that SCSI3-PR (persistent reservation) closes the door to data base corruptions in case of node eviction.

HP-UX: The story is not much different on HP-UX, although I would add one more major feature Service Guard has that does not exists on Solaris systems: the concept of a tie-breaking node.

HP-UX RAC storage options

As with Solaris, we have Service Guard, the generic cluster software from the hardware provider, Veritas Cluster Services and the Oracle clusterware alone.

10gR1 : The lack of mirroring for the OCR and voting disks makes a third party cluster almost compulsory (the “almost” is a word of caution because SAN mirroring technologies, based for example on virtualized luns might be used).

10gR2:

  • RAC+ Service Guard (with Service Guard Extension for RAC):
  1. SLVM, the shared volume option of the legacy HP-UX LVM had been the only option since the days of Oracle Parallel Server. It only allows data file raw access; hence archivelogs must be on “local” file systems (non shared volumes). A number of limitations, such as having to bring down all nodes but one for Volume Group maintenance exists.
  2. Since HP-UX 11iv2 and Service Guard 11.16, and after announcing for 2 years a port of the acclaimed True Cluster Advanced File Systems, HP finally closed an agreement with Veritas for the integration of the Veritas Cluster File Systems (CFS) and Cluster Volume Manager (CVM) into Service Guard. This is essentially supported by some extra SG commands (cfsdgadm, cfsdgadm and cfsmntadm), but does not provide much more significant functionality on top of the Veritas Storage Foundation for HA in the storage options. All of the Oracle files, including OCR and vote can be put on the CFS. This is because the CFS can be brought up before the CRS starts.
  • RAC + Veritas Cluster Services (VCS): The clusterware is Veritas based, but as you can guess, CFS and CVM are used in this option.
  • RAC without any third party cluster: As on Solaris, the only storage options with the Oracle clusterware alone is the ASM, which conveys for large data bases the same mirroring problems as found on Solaris.

Membership, split brain and amnesia

Split brain and tie breaking

Just like Sun Cluster Services and the Oracle clusterware, Service Guard has a tiebreaker cluster lock to avoid a split of two subclusters of equal size, a situation bound to arise when network connectivity is completely lost between some of the nodes. This tiebreaker is storage based (either a physical disk for SCS, an LVM volume for SG or a file for the CRS). However, geo clusters made of two physical sites make this tiebreaker configuration impossible: on which site do you put the extra volume, disk or files? A failure of the site chosen for 2 of the 3 volumes will cause the loss of the entire cluster. A third site HAS to be used. Assuming Fiber Channel is not extended to the 3rd site, an iscsi or NFS configuration (on a supported nfs provider such as NetApp) may be the most convenient option if an IP link exists.

With the Quorum Server, Service Guard can provide a tie breaking service: this can be done either as a stand-alone Quorum Server or as a Quorum Service package running on a cluster which is external to the cluster for which Quorum Service is being provided. Although the 3rd site is still needed, a stand alone server reachable via IP will resolve the split brain. The CRS is nonetheless still a cause of concern: the voting disk on the 3rd site must be seen by the surviving nodes. The Quorum Server may be a cluster service which belongs to a second SG cluster, but its CFS storage is not accessible from the first cluster. As the end of the day, although the Quorum Server is a nice SG feature, it cannot solve the CRS voting disks which must still be accessed through isci or NFS.

I/O Fencing

The same paragraph I wrote for Solaris can be copied and pasted:

Both VCS and SG use SCSI3 persistent reservation via ioctl, and I/O fencing to prevent corruption. Each node registers a key (it is the same for all the node paths). Once node membership is established, the registration keys of all the nodes that do not form part of the cluster are removed by the surviving nodes of the cluster. This blocks write access to the shared storage from evicted nodes.

April 17, 2007

RAC geographical clusters and 3rd party clusters (Sun Solaris) (1/3)

Filed under: Oracle,RAC,Solaris — christianbilien @ 9:06 pm

As a word of introduction a geographical RAC cluster is a RAC where at least one node is physically located in a remote location, and DB access is still available should one of the sites go down.

I found that many customers wishing to implement a RAC geo cluster get confused by vendors when it comes to the RAC relationships (or should I say dependencies) with third party clusters. I also have the impression that some Oracle sales rep tend to participate to this confusion by encouraging troubled prospects in one way or in another, depending of their particular interest with a hardware/cluster 3rd party provider.

Let’s first say that I am here just addressing the RAC options. Assuming some other applications need clustering services, a third party cluster will be necessary (although some provisions, still in infancy, exist within the CRS to “clusterize” non-RAC services). I’ll also deliberately not discuss NAS storage as I never had the opportunity to work or even consider a RAC/NAS option (Pillar, NetApp, and a few others are trying to get into this market).

This first post is about RAC geo clusters on Solaris. RAC geo clusters on HP-UX will be covered here.

The Solaris compatibility matrix is located at https://metalink.oracle.com/metalink/plsql/f?p=140:1:2790593111784622179

I consider two cluster areas to be strongly impacted by the “third party cluster or not” choice: storage and membership strategy. Some may also argue about private interconnect protection against failure, but since IPMP may be used for the RAC-only option, and although some technical differences exist, I think that this is a matter of much less importance that storage and membership.

Storage:

  • 10gR1 was very special as it did not have any Oracle protection for the vote and ocr volumes. This lack of functionality had a big impact on geo clusters as some third party storage clustering was required for vote and OCR mirroring.
  • 10gR2: The options may not be the same for OCR/vote, data base files, binary and archivelog files. Although archivelog files on a clustered file system saves NFS mounts, binary and archivelogs may usually be located on their own “local” file system which may on the array, but only seen from one node. The real issues are on one hand the DB files, on the other hand the OCR and voting disk which are peculiar because they must be seen when the CRS starts, BEFORE the ASM or any Oracle dependent process can be started.
  • RAC+ Sun Cluster (SCS): The storage can either be a Solaris volume manager and raw devices or QFS, GFS is not supported. ASM may be used but offers little in my opinion compared to a volume manager. ASM used for mirroring suffers from the mirroring reconstruction that has to be performed when one of site is lost and the lack of any feature similar to a copy of modified blocks only (the way storage mirroring does).
  • RAC + Veritas Cluster Services (VCS): the Veritas cluster file system (the VxFs cluster version), running over the Cluster Volume Manager (the VxVm cluster version) is certainly a good solutions for those adverse to raw device/ASM. All of the Oracle files, including OCR and vote can be put on the CFS. This is because the CFS can be brought up before the CRS starts.
  • RAC without any third party cluster: ASM has to be used for storage mirroring. This is easier to manage and cheaper, although mirrored disk group reconstruction is a concern when volumes are high. I also like not to avoid the coexistence of two clusters (RAC on top of SCS or VCS).

Membership, split brain and amnesia

A number of membership issues are addressed differently by SCS/VCS and the CRS/CSS. It is beyond the scope of this post to explain fencing, split brain and amnesia. There are really two worlds here: on one hand, Oracle has a generic clusterware membership system across platforms, which avoids system and storage dependency, on the other hand VCS and SCS take advantage of SCSI persistent reservation ioctls. Veritas and Sun both advocate that Oracle’s node eviction strategy may create situations in which a node would be evicted from the cluster, but not forced to the boot yet. Other instances may then start recovering instances while the failed instance stills write to the shared storage. Oracle says that database corruption is prevented by using the the voting disk, network, and the control file to determine when a remote node is down. This is done in different, parallel, independent ways. I am not going to enter the war on one side or another, let’s just recall the basic strategies:

  • CSS: this process uses both the interconnects and the voting disks to monitor remote node. A node must be able to access strictly more than half of the voting disks at any time (this is the reason for the odd number of voting disks), which prevents split brain. The css miscount is 30s, which is the network heartbeat time allowance for not responding before eviction.
  • Both VCS and SCS use SCSI3 persistent reservation via ioctl, and I/O fencing to prevent corruption. Each node registers a key (it is the same for all the node paths). Once node membership is established, the registration keys of all the nodes that do not form part of the cluster are removed by the surviving nodes of the cluster. This blocks write access to the shared storage from evicted nodes.

One last bit: although not a mainstream technology (and it won’t improve now that RDS over Infiniband is an option on Linux and soon on Solaris), I believe SCS is needed to allow RSM over SCI/ SunFire Link to be used. The specs show quite an impressive latency of a few micro seconds.

Create a free website or blog at WordPress.com.