Christian Bilien’s Oracle performance and tuning blog

January 19, 2008

Oracle’s clusterware real time priority oddity

Filed under: Oracle,RAC — christianbilien @ 6:16 pm


The CSS processes use both the interconnect and the voting disks to monitor remote node. A node must be able to access strictly more than half of the voting disks at any time (this is the reason for the odd number of voting disks), which prevents split brain. Let’s just recall that split brains are encountered when several cluster “islands” are created without being aware of each other. Oracle uses a modified STONITH (Shoot The Other Node In The Head) algorithm, although instead of being able to fail other nodes, one node can merely instruct the other nodes to commit suicide.

This subtlety has far reaching consequences: the clusterware software on each node MUST be able to coordinate its own actions in any case without relying upon the other nodes. There is an obvious potential problem when the clusterware processes cannot get the CPU in a timely manner, especially as a lot of the cssd code is running in user mode. This can be overcome by raising the cssd priority, something which was addressed by the 10.2.0.2 release by setting the css priority in the ocr registry:

crsctl set css priority 4

Meaning of priority 4
You’ll see on Solaris in /etc/init.d/init.cssd the priority boost mechanism which corresponds to the values you can pass to crsctl set css priority:

PRIORITY_BOOST_DISABLED=0
PRIORITY_BOOST_LOW=1
PRIORITY_BOOST_MID=2
PRIORITY_BOOST_HIGH=3
PRIORITY_BOOST_REALTIME=4
PRIORITY_BOOST_RENICE_LOW=-5
PRIORITY_BOOST_RENICE_MID=-13
PRIORITY_BOOST_RENICE_HIGH=-20
PRIORITY_BOOST_RENICE_REALTIME=0
PRIORITY_BOOST_ENABLED=1
PRIORITY_BOOST_DEFAULT=$PRIORITY_BOOST_HIGH

A bit further down the file:

RTGPID=’/bin/priocntl -s -c RT -i pgid’

Further down:

  if [ $PRIORITY_BOOST_ENABLED = '1' ]; then
    NODENAME=`$CRSCTL get nodename`     # check to see the error codes
    case $? in
    0)
      # since we got the node name, now try toget the actual
      # boost value for the node
     PRIORITY_BOOST_VALUE=`$CRSCTL get css priority node $NODENAME` => retrieves the PRIORITY_BOOST_VALUE

Still further down:

    case $PRIORITY_BOOST_VALUE in
      $PRIORITY_BOOST_LOW)
        # low priority boost
        $RENICE $PRIORITY_BOOST_RENICE_LOW -p $$
        ;;
      $PRIORITY_BOOST_MID)
        # medium level boost
        $RENICE $PRIORITY_BOOST_RENICE_MID -p $$
        ;;
      $PRIORITY_BOOST_HIGH)
        # highest level normal boost
        $RENICE $PRIORITY_BOOST_RENICE_HIGH -p $$
        ;;
      $PRIORITY_BOOST_REALTIME)
        # realtime boost only should be used on platforms that support this
        $RTGPID $$ => realtime 
;;

So setting a priority of 4 should set the current shell and its children to RT using priocntl.

… but the cssd daemons do not run under RT

I noticed this oddity in the 10.2.0.2 release on Solaris 8 following a RAC node reboot under a (very) heavy CPU and memory load. There is in the trace file:

[ CSSD]2008-01-03 18:49:04.418 [11] >WARNING: clssnmDiskPMT: sltscvtimewait timeout (282535)
[ CSSD]2008-01-03 18:49:04.428 [11] >TRACE: clssnmDiskPMT: stale disk (282815 ms) (0//dev/vx/rdsk/racdg/vote_vol)
[ CSSD]2008-01-03 18:49:04.428 [11] >ERROR: clssnmDiskPMT: 1 of 1 voting disk unavailable (0/0/1)

This is a timeout after a 282,535s polling on the voting disk. IO errors were neither reported in /var/adm/messages nor by the storage array.

The priority is unset by default:

$crsctl get css priority
Configuration parameter priority is not defined.
 
$crsctl set css priority 4
Configuration parameter priority is now set to 4.

This is written into the ocr registry (from ocrdump):

[…]
[SYSTEM.css.priority]
UB4 (10) : 4
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}
[…]

Look at the priorities:

$/crs/10.2.0/bin # ps -efl | grep cssd | grep –v grep| grep –v fatal
 0 S     root  1505  1257   0  40 20        ?    297        ?   Dec 27 ?           0:00 /bin/sh /etc/init.d/init.cssd oproc
 0 S     root  1509  1257   0  40 20        ?    298        ?   Dec 27 ?           0:00 /bin/sh /etc/init.d/init.cssd oclso
 0 S     root  1510  1257   0  40 15        ?    298        ?   Dec 27 ?           0:00 /bin/sh /etc/init.d/init.cssd daemo
 0 S   oracle  1712  1711   0  40 15        ?   9972        ?   Dec 27 ?          71:52 /crs/10.2.0/bin/ocssd.bin

the second column in red shows the nice value, which has been amended from 20 to 15, but 40 is still a TimeShare priority, making init.cssd and ocssd.bin less prone to preemption but still exposed to high loads.

oprocd is the only css process to run in Real Time, whatever the priority setting. Its purpose is to detect system hangs

ps -efl| grep oprocd
0 S     root 27566 27435   0   0 RT        ?    304       ?   Nov 17 console     9:49 /crs/10.2.0/bin/oprocd.bin run -t 1

The css priorities remained unchanged after stopping and restarting the crs. Part of the above behavior is described in Metalink note:4127644.8 where it is filed as a bug. I did not test crsctl set css priority on a 10.2.0.3. The priorities are identical to what they are on 10.2.0.2 when unset.

2 Comments »

  1. Hi Christian!

    Nice post.

    I have checked it on 10.2.0.3 on HP-UX v11.23.IA-64, and it works, but only for priority 4:
    set CSS priority real CSSD process priority
    4 -32
    3 152
    2 152
    1 152
    not set 152

    Oleksandr

    Comment by odenysenko — January 21, 2008 @ 2:08 pm

  2. Hi Oleksandr,

    Assuming the init.cssd file is similar under HP-UX (can’t check right now)1,2 and 3 are anti niced. So you could check using ps the nice value, which should contain 15,7 or 0 (20 – antinice value), but still be in the TS priority range (152). And yes -32 is an rtsched priority, so this looks ok.

    Thanks for stopping by

    Christian

    Comment by christianbilien — January 21, 2008 @ 2:28 pm


RSS feed for comments on this post. TrackBack URI

Leave a comment

Create a free website or blog at WordPress.com.