Christian Bilien’s Oracle performance and tuning blog

August 26, 2010

I’ll be a presenter at Oracle Open World 2010

Filed under: Oracle — christianbilien @ 7:39 pm

I submitted an abstract to the Oracle OpenWorld 2010 that was found to be worthwhile enough to be accepted. Here are the session details:

Speaker(s) Christian BILIEN, BNP PARIBAS Corporate Investment Banking, Head of the DBAs
Monday, September 20, 2:00PM | Moscone South, Rm 236 60 min.
Session ID: S314649
Title: Large-Scale Oracle RAC (and Sun Oracle Database Machine) Scalability Design
Abstract: In this technical session, hear how a large corporate investment bank tames a yearly 50 percent to 100 percent data growth rate with Oracle Real Application Clusters (Oracle RAC) and Sun Oracle Database Machines. With databases totaling hundreds of terabytes, an integrated design of databases, storage, and servers is the cornerstone of scalability. This presentation highlights storage design from an Oracle RAC perspective as well as the specific challenges of Oracle RAC data warehouses and Oracle RAC geographical clusters. Real-life measurements on Oracle Solaris, Linux, EMC² storage, and RDS over InfiniBand (outside and inside the Sun Oracle Database Machine) illustrate performance and scalability perspectives and bottlenecks.

February 12, 2008

The “log file sync” wait event is not always spent waiting for an I/O

Filed under: Oracle,Storage — christianbilien @ 4:41 pm

I went through a rather common problem some times ago, which shows how one can get easily fooled by the “log file sync” wait event. Here is a log file sync wait histogram which was interpreted as a poor storage array response time by the DBAs, often a good way to get off the hook of a performance problem:


This looks like a Gaussian distribution with a mean value of maybe 30ms. The same event recorded some weeks earlier shows a mean value located between 8 and 16 ms.


Each log file sync may wait up to an average of 101ms over an hour:

Event                        Avg wait Time (ms)
log file sync                       101

The storage array write destination is a rather loaded EMC DMX1000, and the multiplexed log files are SRDF mirrored. SRDF is the EMC protocol which replicates I/Os from one array to another one. The replication is synchronous here, meaning the remote cache array has to acknowledge each write to the local one before returning a completion status to the LGWR. The SRDF write time must be added to the average write time to reflect the array response time. Some overhead also exists in the OS layer and possibly at the SAN level. Both redo log groups have similar write access patterns. The SRDF write time is close to 30ms over this period of time, a very bad figure. The average write time on the array is 4ms.

This is where a bit of dig in may be interesting:

All calls are emitted from a single program which basically does some small transactions and commits. We’ll consider that no other conditions may trigger a write into the redo log other than a commit.

The average log file parallel write wait over the same hours is not good, but far from the by the 101ms log file sync:

Event                          Avg wait Time (ms)
log file parallel write              37

This fits with the array I/O times.

A quick recap on the ‘log file sync’ wait event may be necessary here:

The log file sync wait event is the time the foreground processes spends waiting for the redo to be flushed. It can be assumed here that the LGWR is always active, it does not have to be awaken by the system dispatcher.The log file sync wait can be broken into:

  1. The ‘redo write time’: this is the total elapsed time of the write from the redo log buffer to the current redo log file (in centiseconds).
  2. The “log file parallel write” is basically the time for the log write I/O to complete
  3. The LGWR may have some post processing to do, then signals the waiting foreground process that the write has completed. The foreground process is finally waken up by the system dispatcher. This completes the ‘log file sync’ wait.

I also extracted two other log write statistics from the awr report:

Statistic          Total     per Second     per Trans
redo write time  275,944           77.6           0.2
user commits   1,163,048          327.1           1.0

776/327 = 2.3ms per user commits of “redo write time”

Step 3 therefore accounts for 101-2.3-37=61,7ms. The most likely cause of the main contributor to the log file sync can be found here:

sar -q -f /var/adm/sa/sa07
         runq-sz %runocc swpq-sz %swpocc
05:00:01   15.9      48     0.0      0
05:05:01   24.1      42     0.0      0
05:10:01   25.2      46     0.0      0
05:15:00   15.5      38     0.0      0
05:20:01   15.6      43     0.0      0
05:25:00   14.4      38     0.0      0
05:30:00   17.2      36     0.0      0
05:35:00   18.0      47     0.0      0
05:40:00   16.9      38     0.0      0
05:45:01   12.3      33     0.0      0
05:50:00   10.1      41     0.0      0
05:55:00   11.4      39     0.0      0
06:00:00   11.1      47     0.0      0

Look at the first column, which shows the CPU run queue size. Many processes are competing for the CPUs. Although we could carry on the investigation by measuring how much time foreground processes are spending in the run queue, we can at this point conclude that the processes are spending about two thirds of the log file sync wait times waiting for the CPU, not for the I/O subsystem.

It would also be interesting for the sake of completeness to understand why the %runocc is relatively small compared to the average run queue size.

January 19, 2008

Oracle’s clusterware real time priority oddity

Filed under: Oracle,RAC — christianbilien @ 6:16 pm

The CSS processes use both the interconnect and the voting disks to monitor remote node. A node must be able to access strictly more than half of the voting disks at any time (this is the reason for the odd number of voting disks), which prevents split brain. Let’s just recall that split brains are encountered when several cluster “islands” are created without being aware of each other. Oracle uses a modified STONITH (Shoot The Other Node In The Head) algorithm, although instead of being able to fail other nodes, one node can merely instruct the other nodes to commit suicide.

This subtlety has far reaching consequences: the clusterware software on each node MUST be able to coordinate its own actions in any case without relying upon the other nodes. There is an obvious potential problem when the clusterware processes cannot get the CPU in a timely manner, especially as a lot of the cssd code is running in user mode. This can be overcome by raising the cssd priority, something which was addressed by the release by setting the css priority in the ocr registry:

crsctl set css priority 4

Meaning of priority 4
You’ll see on Solaris in /etc/init.d/init.cssd the priority boost mechanism which corresponds to the values you can pass to crsctl set css priority:


A bit further down the file:

RTGPID=’/bin/priocntl -s -c RT -i pgid’

Further down:

  if [ $PRIORITY_BOOST_ENABLED = '1' ]; then
    NODENAME=`$CRSCTL get nodename`     # check to see the error codes
    case $? in
      # since we got the node name, now try toget the actual
      # boost value for the node
     PRIORITY_BOOST_VALUE=`$CRSCTL get css priority node $NODENAME` => retrieves the PRIORITY_BOOST_VALUE

Still further down:

        # low priority boost
        # medium level boost
        # highest level normal boost
        # realtime boost only should be used on platforms that support this
        $RTGPID $$ => realtime 

So setting a priority of 4 should set the current shell and its children to RT using priocntl.

… but the cssd daemons do not run under RT

I noticed this oddity in the release on Solaris 8 following a RAC node reboot under a (very) heavy CPU and memory load. There is in the trace file:

[ CSSD]2008-01-03 18:49:04.418 [11] >WARNING: clssnmDiskPMT: sltscvtimewait timeout (282535)
[ CSSD]2008-01-03 18:49:04.428 [11] >TRACE: clssnmDiskPMT: stale disk (282815 ms) (0//dev/vx/rdsk/racdg/vote_vol)
[ CSSD]2008-01-03 18:49:04.428 [11] >ERROR: clssnmDiskPMT: 1 of 1 voting disk unavailable (0/0/1)

This is a timeout after a 282,535s polling on the voting disk. IO errors were neither reported in /var/adm/messages nor by the storage array.

The priority is unset by default:

$crsctl get css priority
Configuration parameter priority is not defined.
$crsctl set css priority 4
Configuration parameter priority is now set to 4.

This is written into the ocr registry (from ocrdump):

UB4 (10) : 4

Look at the priorities:

$/crs/10.2.0/bin # ps -efl | grep cssd | grep –v grep| grep –v fatal
 0 S     root  1505  1257   0  40 20        ?    297        ?   Dec 27 ?           0:00 /bin/sh /etc/init.d/init.cssd oproc
 0 S     root  1509  1257   0  40 20        ?    298        ?   Dec 27 ?           0:00 /bin/sh /etc/init.d/init.cssd oclso
 0 S     root  1510  1257   0  40 15        ?    298        ?   Dec 27 ?           0:00 /bin/sh /etc/init.d/init.cssd daemo
 0 S   oracle  1712  1711   0  40 15        ?   9972        ?   Dec 27 ?          71:52 /crs/10.2.0/bin/ocssd.bin

the second column in red shows the nice value, which has been amended from 20 to 15, but 40 is still a TimeShare priority, making init.cssd and ocssd.bin less prone to preemption but still exposed to high loads.

oprocd is the only css process to run in Real Time, whatever the priority setting. Its purpose is to detect system hangs

ps -efl| grep oprocd
0 S     root 27566 27435   0   0 RT        ?    304       ?   Nov 17 console     9:49 /crs/10.2.0/bin/oprocd.bin run -t 1

The css priorities remained unchanged after stopping and restarting the crs. Part of the above behavior is described in Metalink note:4127644.8 where it is filed as a bug. I did not test crsctl set css priority on a The priorities are identical to what they are on when unset.

January 6, 2008

Where has all my memory gone ?

Filed under: Oracle — christianbilien @ 8:09 pm

A while ago, I came across an interesting case of memory starvation on a Oracle DB server running Solaris 8 that was for once not directly related to the SGA or the PGA. The problem showed up from a user perspective as temporary “hangs” that only seemed to happen at a specific time of the day. This server is dedicated to a single 10gR2 Oracle instance. Looking at the OS figures, the first things that I saw were some vmstat signs of memory pressure:

A high number of page reclaims, 200 to 500 Mb of free memory left out of 16GB and 2000 to 3000 pages/s scanned by the page scanner. Look at how memory is allocated using prtmem:

Total memory:           15614 Megabytes
Kernel Memory:           1534 Megabytes
Application:            12888 Megabytes
Executable & libs:        110 Megabytes
File Cache:               410 Megabytes
Free, file cache:         250 Megabytes
Free, free:               430 Megabytes

But look at the Oracle SGA and PGA:


Total System Global Area 6442450944 bytes
Fixed Size                  2038520 bytes
Variable Size            3489662216 bytes
Database Buffers         2936012800 bytes
Redo Buffers               14737408 bytes


  select * from v$pgastat
 NAME                                                                  VALUE UNIT
---------------------------------------------------------------- ---------- ------------
aggregate PGA target parameter                                   1048576000 bytes
aggregate PGA auto target                                          65536000 bytes
global memory bound                                                 2258944 bytes
total PGA inuse                                                  1181100032 bytes
total PGA allocated                                              2555433984 bytes
maximum PGA allocated                                            2838683648 bytes
total freeable PGA memory                                         755367936 bytes
process count                                                          1943
max processes count                                                    2273
PGA memory freed back to OS                                      2.5918E+11 bytes
total PGA used for auto workareas                                   9071616 bytes

Well, that’s at this point 7.5GB (again out of 16GB) for the PGA currently in use + the SGA allocation. I assumed here that because of the memory pressure, the unused part of the PGA was already paged out. Prtmem showed an “application” memory size of 12.9GB.

Where are the 5.4 GB gone ?

I looked at the structure of one of the processes using pmap –x:

pmap -x 14816
         Address   Kbytes Resident Shared Private Permissions       Mapped File
0000000100000000  100536   44600   44600       - read/exec         oracle
000000010632C000     816     560     368     192 read/write/exec   oracle
00000001063F8000     912     904       -     904 read/write/exec     [ heap ]
0000000380000000   16384   16384   16384       - read/write/exec/shared  [ ism shmid=0x5004 ]
00000003C0000000 3145728 3145728 3145728       - read/write/exec/shared  [ ism shmid=0x2005 ]
0000000480000000 3129360 3129360 3129360       - read/write/exec/shared  [ ism shmid=0x11007 ]
FFFFFFFF7B270000     128      48       -      48 read/write          [ anon ]
FFFFFFFF7B300000      64      64       -      64 read/write          [ anon ]
FFFFFFFF7B310000     448     328       -     328 read/write          [ anon ]
FFFFFFFF7B400000       8       8       8       - read/write/exec/shared   [ anon ]
FFFFFFFF7B500000       8       8       -       8 read/write/exec     [ anon ]
FFFFFFFF7B600000      16      16      16       - read/exec
FFFFFFFF7B704000       8       8       -       8 read/write/exec
FFFFFFFF7B800000       8       8       -       8 read/write/exec     [ anon ]
FFFFFFFF7B900000     216     216     216       - read/exec
FFFFFFFF7BA34000      16      16       -      16 read/write/exec
FFFFFFFF7BB00000      24      24      24       - read/exec
FFFFFFFF7BC06000       8       8       -       8 read/write/exec
FFFFFFFF7BD00000      32      32      32       - read/exec
FFFFFFFF7BE08000       8       8       -       8 read/write/exec
FFFFFFFF7BF00000     728     728     728       - read/exec
FFFFFFFF7C0B6000      56      56       -      56 read/write/exec
FFFFFFFF7C0C4000       8       8       -       8 read/write/exec
FFFFFFFF7C100000       8       8       -       8 read/write/exec     [ anon ]
FFFFFFFF7C200000       8       8       8       - read/exec
FFFFFFFF7C302000       8       8       -       8 read/write/exec
FFFFFFFF7C400000       8       8       -       8 read/write/exec
FFFFFFFF7C500000      32      24      24       - read/exec
FFFFFFFF7C608000       8       8       -       8 read/write/exec
FFFFFFFF7C700000      56      56      56       - read/exec
FFFFFFFF7C80E000      16      16       -      16 read/write/exec
FFFFFFFF7C900000     672     672     672       - read/exec
FFFFFFFF7CAA8000      64      64       -      64 read/write/exec
FFFFFFFF7CAB8000      32      32       -      32 read/write/exec
FFFFFFFF7CB00000       8       8       -       8 read/write/exec     [ anon ]
FFFFFFFF7CC00000       8       8       8       - read/exec
FFFFFFFF7CD02000       8       8       -       8 read/write/exec
FFFFFFFF7CE00000    2176     376     376       - read/exec
FFFFFFFF7D11E000     240     232      16     216 read/write/exec
FFFFFFFF7D15A000       8       -       -       - read/write/exec
FFFFFFFF7D200000      72      72      72       - read/exec
FFFFFFFF7D310000       8       8       -       8 read/write/exec
FFFFFFFF7D400000    1056     112     112       - read/exec
FFFFFFFF7D606000      48      32       -      32 read/write/exec
FFFFFFFF7D612000       8       -       -       - read/write/exec
FFFFFFFF7D700000       8       8       -       8 read/write/exec     [ anon ]
FFFFFFFF7D800000    9256    3272    3272       - read/exec
FFFFFFFF7E208000     560     472       8     464 read/write/exec
FFFFFFFF7E300000    1056     208     208       - read/exec
FFFFFFFF7E506000      56      48      16      32 read/write/exec
FFFFFFFF7E514000       8       -       -       - read/write/exec
FFFFFFFF7E600000    1256     136     136       - read/exec
FFFFFFFF7E838000      64      56       -      56 read/write/exec
FFFFFFFF7E848000       8       -       -       - read/write/exec
FFFFFFFF7E900000    1368     536     536       - read/exec
FFFFFFFF7EB54000      72      56       -      56 read/write/exec
FFFFFFFF7EC00000       8       8       -       8 read/write/exec     [ anon ]
FFFFFFFF7ED00000       8       8       8       - read/exec
FFFFFFFF7EE00000       8       8       -       8 read/write/exec
FFFFFFFF7EF00000    1736    1088    1088       - read/exec
FFFFFFFF7F1B0000      72      64       -      64 read/write/exec
FFFFFFFF7F1C2000       8       8       -       8 read/write/exec
FFFFFFFF7F200000     128     128     128       - read/exec
FFFFFFFF7F31E000      16      16       -      16 read/write/exec
FFFFFFFF7F400000       8       8       8       - read/exec
FFFFFFFF7F500000       8       8       -       8 read/write/exec     [ anon ]
FFFFFFFF7F600000     176     176     176       - read/exec
FFFFFFFF7F72C000      16      16       -      16 read/write/exec
FFFFFFFF7FFE0000     128     128       -     128 read/write          [ stack ]
----------------  ------  ------  ------  ------
        total Kb 6416104 6347336 6344392    2944

Look at the resident size of the private section of the segments: the total private size is about 2.9MB. The largest private chunk is the heap, but it is only 1/3rd of the total private space. The remaining part of the private area resident in physical memory is made of anon segments and of private data sections.

_use_real_free_heap=true (the default in Oracle 10), meaning different heaps are used for the process portion of the PGA plus the CGA (call global area) and the UGA and possibly of other components. _use_ism_for_pga is also set to its default value (false), meaning the PGA is indeed part of the heap, not allocated from an ISM segment.

This is where it gets interesting: the number of oracle user processes at a given point at this time of the day is around 2000. I plotted for 100 randomly selected oracle processes the private size occupied by each of the processes: they all had a private memory of 3MB +/- 10%, and it was unlikely that any of the processes would significantly allocate more memory than the others. 3MB of private memory/process x 2000= 6GB: that’s about 4.9GB of “non PGA” private space (PGA in use=1.1GB).This example highlights the fact that UGA, CGAs and other portions of private memory, although seldom accounted for sizing the memory are not always negligible when the data base hosts many connexions.

December 24, 2007

11g ASM preferred reads for RAC extended clusters

Filed under: Oracle,RAC — christianbilien @ 10:14 am

I expressed this summer my delight over the 11g fast resync option for extended (geographical) RAC clusters, for which the mirrored failure groups have to be on different sites, like having one of the failure group on array 1 on site 1 and the mirrored copy on array 2 on site 2. The major 10g limitations are twofold in this area:

  • The loss of an array means the mandatory reconstruction of the whole data base.
  • Failgroups have to be reconstructed serially (no more than one rebalance activity per node, per ASM group).

This is really painful as any scheduled shutdown of one array while the other one is still active is analogous to an outage. Imagine what it takes to rebuild 100TB.

Normal redudancy (mirrored) reconstruction is much slower than either the SAN or the storage layout allows (there is room for “research” here – the best I have seen so far is 150GB/Hrs, that’s 43 MB/s on 64 disks 15000 rpm on the highest DMX storage array. The FC ports where 2x 4Gb/s).

The second problem with RAC 10g extended clusters is linked to the way the ASM storage allocator will create primary and mirrored extent. In a disk group made of two failgroups, notions of “primary fail group” and “secondary fail group” will not exist. Each failgroup contains an equal number of primary and mirrored copies. Let’s take the example of a diskgroup made of two failgroups, each on a different storage array. For each of the storage arrays, here is the number of primary and mirrored copies for a given group (it is made of just one datafile here):

We’ll use the X$KFFXP table. The columns of interest here are:

GROUP_KFFXP : disk group number
LXN_KFFXP: 0: primary extent/ 1: mirrored extent
DISK_KFFXP: disk where the extent has been created
NUMBER_KFFXP : ASM file number

select disk_kffxp, lxn_kffxp, count(*) from x$kffxp x, v$asm_file v
where group_kffxp=1 and x.number_kffxp=v.file_number and xnum_kffxp!=65534
group by disk_kffxp, lxn_kffxp;


 1          0       1050

 1          1       1050

 2          0       1050

 2          1       1050

The number of primary and mirrored extents is balanced on each of the disks of group 1. The logic is that whilst write access will equally hit both types of extents, reads will only access the primary extents. It makes sense to spread primary extents over the two failure groups to maximize throughput when the two failure groups are on an equal foot: same throughput and same I/O response time.

But what happens when the boxes are in different locations? As each instance will read on both arrays, the distance factor will introduce delays on half of the reads. The delay is a function of the distance, but we also need to take into account that the number of storage hops the reads must traverse will increase when remote sites are to be reached. Even worse, the inter sites links are bound to congestion as they must accommodate oversubscription and link contention.

Whenever possible, a clever use of a RAC data locality will reduce the RAC interconnect burden and reduce reliance on cache fusion. It can also in 11G reduce the inter site link throughput requirements and possibly speed up reads by reading data locally, assuming the local array is not itself overloaded.

The preferred read is an instance wide initialization parameter:

ASM_PREFERRED_READ_FAILURE_GROUPS = diskgroup_name.failure_group_name,…

ASM extent layout reference: Take a look at Luca Canali’s work: a peek inside Oracle ASM metadata

— Updated: check the comment below for Luca’s updated links – very interesting indeed —

December 13, 2007

One of Pavlov’s dogs (2/2)

Filed under: Oracle,RAC,Solaris — christianbilien @ 9:15 pm

I didn’t get much chance with the Pavlov’s dogs challenge : not even a single attempt at an explanation. Maybe it was too weird a problem (or more embarrassing it did not interest anyone! ).

A quick reminder about the challenge: netstat shows almost no activity on an otherwise loaded interconnect (500Mb/s to 1Gb/s inbound and similar values going outbound as seen on the Infiniband switches and calculated as the product of PX remote messages recv’d/send x parallel_execution_message_size).

Well, anyway, here is what I think is the answer: the key information I gave was that the clusterware was running RDS over Infiniband. Infiniband HCAs have an inherent advantage over standard Ethernet network interfaces: they embed RDMA, which means that all operations are handled without interrupting the CPUs. That’s because the sending nodes read and write to the receiving node using user space memory, without going through the usual I/O channel. TCP/IP NICs also cause a number of interrupts the CPUs have to process because TCP segments have to be reconstructed while other threads are running.

The most likely cause of the netstat blindness is just that it cannot see the packets because the CPUs are unaware of them.

To quote the Wikipedia Pavlov’s dogs article, “the phrase “Pavlov’s dog” is often used to describe someone who merely reacts to a situation rather than use critical thinking”. That’s exactly what I thought of myself when I was trying to put the blame on the setup instead of thinking twice about the “obvious” way of measuring a network packet throughput.

December 9, 2007

One of Pavlov’s dogs (1/2)

Filed under: Oracle,RAC,Solaris — christianbilien @ 7:44 pm

That’s what I thought of myself after spending an hour trying to figure out what I had done wrong in my setup.

So here is a little challenge I derived from this experience:

1. I set up a 5 node 10gR2 RAC over a mix of 5 Xeon and Opteron 4 cores. The interconnect is RDS over an HCA Infiniband. The setup was made on OpenSolaris. I know, RAC is not a supported on OpenSolaris but we were advised by Sun MicroSystems that the Infiniband layer was significantly faster on OpenSolaris that it was on Solaris 10. Some ftp tests indeed showed that even IPoIB was 20% faster on OpenSolaris than on Solaris 10. So I had to tweak the Oracle installer and could not use any of the Oracle Guis, but I got it working.

2. select /*+ full(c) full(o) */ * from soe.product_descriptions c,soe.product_information o where c.product_id=o.product_id and c.product_id>300000;

was run simultaneously 10 times on each instance (this is incidentally a data base generated by Swingbench ). The DOP varies from 10 to 40, but it does not make much difference as far as our challenge is concerned.

I then plotted the ‘PX remote messages recv’d’ and ‘PX remote messages sent’.


I could get the following peak throughput (5s interval) by multiplying parallel_execution_message_size by the PX remote messages figures:

PX remote messages recv’d Mb/s PX remote messages sent Mb/s







I am not taking into account the GES and the GCS messages, nor did I count the cache fusion blocks. Both of them were small anyway. The weird thing came when I tried to measure the corresponding HCA Input / Output packet number from netstat:

Input Packets/s Output Packets/s







Almost no traffic on the interconnect (the HCA MTU is 2044 bytes) !

Let’s check that the interconnect is running over the intended HCA:

netstat -ni

Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs CollisQueue
lo0 8232 32364353 0 32364353 0 0
ibd0 2044 27588368 0 26782527 0 0
nge1 1500 10.x.x.x 10.x.x.x 640710585 0 363462595 0 0

SQL> oradebug setmypid
SQL> oradebug ipc
SQL> oradebug tracefile_name

SSKGXPT 0x64362c8 flags SSKGXPT_READPENDING socket no 8 IP

Just to be sure:

SQL> select indx,picked_ksxpia,ip_ksxpia from x$ksxpia;
0 CI

Question: What could possibly justify the discrepancies between the netstat output and the values collected from v$px_sesstat ?

Hints: it is not a bug


November 17, 2007

A short, but (hopefully) interesting walk in the HP-UX magic garden

Filed under: HP-UX,Oracle — christianbilien @ 4:57 pm

While many in the Oracle blogosphere seemed to be part of a huge party in San Francisco, I was spending some time researching a LGWR scheduling issues on HP-UX for a customer. While working on it, I thought that a short walk in the magic garden may be of interest to some readers (the Unix geeks will remember in the 90”s “The magic garden explained: The Internals of Unix System V Release 4”. It is amazingly still sold by Amazon though second hand copies only are available).

I already blogged on HP-UX scheduling :hpux_sched_noage and hp-ux load balancing on SMPs.

I work with q4, a debugger I use extensively when I teach at HP HP-UX performance or HP-UX internals. This first post on the HP-UX internals will take us on a walk into the process and thread structure, where I’ll highlight the areas of interest to scheduling.

Let’s first copy the kernel to another file before preparing it for debugging:

$cp /stand/vmunix /stand/vmunix.q4
$q4pxdb /stand/vmunix.q4

q4 can then be invoked:

$q4 /dev/mem/stand/vmunix.q4

(or use ied to enable command recall and editing):

$ied -h $HOME/.q4_history q4 /dev/mem /stand/vmunix.q4)

@(#) q4 $Revision: 11.X B.11.23l Wed Jun 23 18:05:11 PDT 2004$ 0
see help topic q4live regarding use of /dev/mem
Reading kernel symbols …
Reading debug information…
Reading data types …
Initialized ia64 address translator …
Initializing stack tracer for EM…

Then let’s load the whole proc structures from the proc table. Unlike older versions, nproc is not the size of hash table but rather the maximum size of a linked list.

q4> load struct proc from proc_list max nproc next p_factp

Here are some fields of interest to the scheduling policy:

p_stat  SINUSE => proc entry in use
p_pri  0x298 => default priority
p_nice  0x14 => nice value
p_ticks   0xf8113a => Number of clock ticks charged to this process
p_schedpolicy  0x2 => Scheduling policy (real time rtsched or rtprio, etc.)

p_pri and p_nice are seen in ps, top, glance, etc.

p_firstthreadp  0xe0000001345d10c0  and p_lastthreadp  0xe0000001345d10c0 

are the backward and forward pointer links to the other threads of the process. They are identical here because the process is single-threaded.

We’ll just keep one process to watch:

q4> keep p_pid == 5252

and we can now load this process thread list:

q4> load struct kthread from kthread_list max nkthread next kt_factp

Here are some of the fields I know of related to scheduling or needed for accessing the next thread:

kt_link  0 => forward run/sleep queue link
kt_rlink  0 => backward run queue link
kt_procp  0xe00000012ffd2000 => a pointer to its proc  structure
kt_nextp  0 => next thread in the same process (no next thread here)
kt_prevp  0 => previous thread in the same process (no previous thread here)
kt_wchan  0xe0000001347d64a8 => sleep wait channel pointer
kt_stat  TSSLEEP => thread status. Not much to do !
kt_cpu  0 => Charge the running thread with cpu time accumulated
kt_spu  0 => the spu number of the run queue that the thread is currently on
kt_spu_wanted  0 => the spu it would like to be on (a context switch will happen if kt_spu  <> kt_spu_wanted)
kt_schedpolicy  0x2 => scheduling policy for the thread
kt_ticksleft   0xa => number of ticks left before a voluntary timeslice will be requested

kt_usrpri  0x2b2
 and  kt_pri  0x29a

are equal while the thread are running in the timeshare scheduling mode. As the thread is sleeping on priority kt_pri, kt_usrpri will contain the re-calculated priorities. On wakeup, the value of kt_usrpri is copied to kt_pri.




November 8, 2007

Can an explain plan on a SELECT self deadlock? Yes it can … but it is a bug

Filed under: Oracle — christianbilien @ 11:49 am

I had to work on a problem last week that may be of interest to some readers of this blog: the symptoms were a general slowness, and some sessions appeared to be hung. The database version is The initial investigation showed that many sessions were waiting on the cursor: pin S wait on X’ event (from the “oracle performance and tuning” book, “a session waits for this event when it is requesting a shared mutex pin and session is holding an exclusive mutex pin on the same cursor object”).

A look at Metalink pointed at a Bug.5485914: MUTEX REPORTED SELF DEADLOCK or a variant of it as several related bugs showing similar symptoms exist.

I got a blocker (oper EXCL) and several waiters (oper GET_SHRD) on the same idn (this is the hash value of the cursor) from a systemstate dump .


grep Mutex mysid1_ora_19352.trc | grep EXCL
Mutex 48b914ac8(2811, 0) idn d376b3ad oper EXCL

grep Mutex mysid1_ora_19352.trc | grep GET_SHRD
Mutex 48b914ac8(2811, 0) idn d376b3ad oper GET_SHRD
Mutex 48b914ac8(2811, 0) idn d376b3ad oper GET_SHRD
Mutex 48b914ac8(2811, 0) idn d376b3ad oper GET_SHRD
Mutex 48b914ac8(2811, 0) idn d376b3ad oper GET_SHRD


<the number of lines which contained “GET_SHRD” were the same as the number of sessions which waited on ‘cursor: pin S wait on X’>.
Let’s expand the systemstate dump to see if we can get some information. The raw trace shows:



437231 kkkk-dddd-llll=0000-0001-0001 lock=N pin=X latch#=5 hpc=0002 hlc=0002
437232 lwt=512358c50[512358c50,512358c50] ltm=512358c60[512358c60,512358c60]
437233 pwt=512358c18[512358c18,512358c18] ptm=512358c28[512358c28,512358c28]
437234 ref=512358c80[4858d43b8,4858d43b8] lnd=512358c98[512358c98,512358c98]

437235 LIBRARY OBJECT: object=4ac8ff848
437236 type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
437237 DEPENDENCIES: count=4 size=16
437238 ACCESSES: count=3 size=16
437240 data# heap pointer status pins change whr
437241 —– ——– ——– ——— —- —— —
437242 0 429db5ae0 4ac8ff960 I/P/A/-/- 0 NONE 00
437243 6 48b914bc0 4156e8b08 I/P/A/-/E 0 NONE 00
437244 —————————————-
437245 KGX Atomic Operation Log 5124e0970

437246 Mutex 48b914ac8(2811, 0) idn d376b3ad oper EXCL
437247 Cursor Pin uid 2811 efd 0 whr 1 slp 0
437248 opr=3 pso=3fa9b9158 flg=0
437249 pcs=48b914ac8 nxt=0 flg=35 cld=0 hd=512358ba8 par=4b6560520
437250 ct=0 hsh=0 unp=0 unn=0 hvl=b65607f8 nhv=1 ses=539af3098
437251 hep=48b914b48 flg=80 ld=1 ob=4ac8ff848 ptr=4156e8b08 fex=4156e7e18


This a KGX mutex which protects library cache pins from instead of the library Library Cache Pin latch.

A blocked session on the same cursor :


kkkk-dddd-llll=0000-0001-0001 lock=N pin=X latch#=5 hpc=0002 hlc=0002
lwt=512358c50[512358c50,512358c50] ltm=512358c60[512358c60,512358c60]
pwt=512358c18[512358c18,512358c18] ptm=512358c28[512358c28,512358c28]
ref=512358c80[4858d43b8,4858d43b8] lnd=512358c98[512358c98,512358c98]
LIBRARY OBJECT: object=4ac8ff848
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
DEPENDENCIES: count=4 size=16
ACCESSES: count=3 size=16
data# heap pointer status pins change whr
—– ——– ——– ——— —- —— —
0 429db5ae0 4ac8ff960 I/P/A/-/- 0 NONE 00
6 48b914bc0 4156e8b08 I/P/A/-/E 0 NONE 00
KGX Atomic Operation Log 482acfdf0
Mutex 48b914ac8(2811, 0) idn d376b3ad oper GET_SHRD

Cursor Pin uid 2853 efd 0 whr 5 slp
opr=2 pso=3f9bcf4e8 flg=0
pcs=48b914ac8 nxt=0 flg=35 cld=0 hd=512358ba8 par=4b6560520
ct=0 hsh=0 unp=0 unn=0 hvl=b65607f8 nhv=1 ses=539af3098
hep=48b914b48 flg=80 ld=1 ob=4ac8ff848 ptr=4156e8b08 fex=4156e7e18

a little bit further down the file :

SO: 539b0dc78, type: 4, owner: 53a239108, flag: INIT/-/-/0x00
session) sid: 2853 trans: 52da63ae8, creator: 53a239108, flag: (41) USR/- BSY/-/-/-/KIL/-
DID: 0001-044C-00001916, short-term DID: 0000-0000-00000000
txn branch: 52b0cfb98
oct: 3, prv: 0, sql: 4bcc7a680, psql: 482bef848, user: 360/MYSIDDBMAINT
O/S info: user: apache, term: , ospid: 6782, machine: dwhsrv-db02
program: oracleEMDWH1@dwhsrv-db02 (TNS V1-V3)
application name: oracleEMDWH1@dwhsrv-db02 (TNS V1-V3), hash value=0
waiting for ‘cursor: pin S wait on X’ blocking sess=0x0 seq=30138 wait_time=0 seconds since wait started=0
value=afb00000000, where|sleeps=503c72f15
Dumping Session Wait History
for ‘cursor: pin S wait on X’ count=1 wait_time=19214
, value=afb00000000, where|sleeps=503c72f02
for ‘cursor: pin S wait on X’ count=1 wait_time=9996
, value=afb00000000, where|sleeps=503c72ef0
for ‘cursor: pin S wait on X’ count=1 wait_time=19143
, value=afb00000000, where|sleeps=503c72edf


Access to object 4ac8ff848 is obviously causing trouble. One easy way to get a more user friendly reading of the systemstate dump is to download LTOM from note id 352363.1 in metalink to get ass109.awk which is found in the src directory.

It does not seem to work with awk:

awk -f ass109.awk mysid1_ora_19352.trc awk: syntax error near line 113 awk: bailing out near line 113
but it does with nawk:
nawk -f ass109.awk mysid1_ora_19352.trc >ass109.out
 From ass109.out:


[…] Below is a summary of the waited upon resources, together with the holder of that resource.

o A process id of ‘???’ implies that the holder was not found in the systemstate.
Resource Holder State
LOCK: handle=4f7be3a98 1020: last wait for ‘library cache lock’
Enqueue TT-00000006-00000000 ??? Blocker
Object Names
LOCK: handle=4f7be3a98 TABL:HOLLAND.EQUITY
Enqueue TT-00000006-00000000
151:waiting for ‘library cache lock’ [LOCK: handle=4f7be3a98] wait
791:waiting for ‘library cache lock’ [LOCK: handle=4f7be3a98] wait
1076:waiting for ‘library cache lock’ [LOCK: handle=4f7be3a98] wait


As you can see above, the blocker cannot be found by ass1009.awk, but we can trace to the object HOLLAND.EQUITY the waiting sessions.

sqlplus> desc HOLLAND.EQUITY

indeed hung.

I traced the offending call on this table to an explain plan of a SELECT that will be duly submitted to the Oracle support. One last word: I had to kill the session that issued it both in Oracle and at the system level (kill -9). I reproduced the problem and tried the other way round (killing the process first), but the session also had to be killed within Oracle.

I googled for my problem which seems to fall in the same category as . Jonathan Lewis also picked it up is however in my (reproducible) case no call to dbms_stats. As some of the related bugs refer to sql_trace, I might be closer to them.

How to get a systemstate dump (

SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL> oradebug tracefile_name





September 28, 2007

Is dbms_stats.set_xxx_stats insane tuning ? (part 2/2)

Filed under: Oracle — christianbilien @ 1:37 pm

This is the second post about using dbms_stats.set_xxx_stats (and feeling guilty about it). I gave in the first post some straightforward cost calculations about an oversimplified (yet real) example of an index scan and the underlying table scan as a preparatory step.

select * from mytable d where status_code=’AAAA’;

status_code is indexed by MYTABLEEI000003, no histograms were calculated. The “adjusted dbf_mbrc” (obtained by dividing the full table scan cost by the number of blocks below the high water mark is 6.59). This is all detailed in here .

Things I did not emphasize in this case :

The number of distinct value of status_code is 6 and never varies whatever the number of lines in the table. The selectivity of status_code will always be 1/6, which is quite high (bad selectivity).

The distribution is very unequal 90% of the rows have the same status code. Bind peeking is enabled but the select call does not necessarily goes to the low cardinality values.

The table scan cost is now BETTER (lower) that the index scan cost.

optimizer_index_cost_adj is set to 50 (the final index cost is index cost calculation as explained in the first post divided by 2).


The index calculation cost of mytableI000003 follows this formula:

Cost=blevel + ceiling(leaf_blocks * effective index selectivity) + ceiling (clustering factor * effective table selectivity)

which can be approximated to:

cost=(leaf blocks + clustering factor)/6

I went back to old statistics by using various copies of the database made from production to development or user acceptance environments to get a picture if what the statistics were in the past. I then plotted the index vs. the full tables scan costs as a function of the number of blocks below the high water mark:



Here is a zoom of the last two points:


The FTS cost was slightly above the index cost in the past, but it is now under it. This is unfortunate as the actual FTS execution time is here always much higher than when the index is chosen.

So what can be done ?

Lowering optimizer_index_cost_adj: the whole instance would be impacted

Creating an outline: I oversimplified the case by just presenting one select but there are actually many calls derived from it, but in any case the index scan is always the good option.

Hints and code changes: this is a vendor package so the code cannot be amended

This is where the use of dbms_stats.set_column_stats or dbms_stats.set_index_stats seems to me to be a valid option. True, tweaking the statistics goes against common sense, and may be seen as insane by some. However we are in a blatant case where:

  1. a slight variation in the statistics calculation causes plans to go from good to (very) bad
  2. the modified statistics will only impacts a known perimeter
  3. we do not have any other satisfactory solution

A last question remains: because of the relative weights involved in the index cost calculation, it is only practicable to make the index more attractive when

Clustering Factor x selectivity

is lowered.

Of the two factors, I prefer to modify the clustering factor because the optimizer calculates the selectivity either from the density or from num_distinct depending upon the histograms presence. One last word: the clustering factor you set will of course be wiped out the next time statistics are calculated on this index.


Modify the index clustering factor :




ownname => ‘MYOWNER’,

indname => ‘MYTABLEI000003’,

clstfct => 100000);



Modify the column density:




ownname => ‘MYOWNER’,

tabname => ‘MYTABLE,

colname => ‘STATUS_CODE’,

density => 0.0001);






Next Page »

Create a free website or blog at