Christian Bilien’s Oracle performance and tuning blog

March 31, 2007

No age scheduling for Oracle on HP-UX: HPUX_SCHED_NOAGE (1/2)

Filed under: HP-UX,Oracle — christianbilien @ 9:00 pm

The HPUX_SCHED_NOAGE initialization parameter has been around for some times, but in most of the situation where I found out it would be useful, sys admins and DBAs alike were a bit scared (to say the least) to use it. However, I cannot see a real threat to using this parameter: some degradation may occur if set inappropriately, but not outages and side effects associated with real time. Here are a couple of notes to explain this parameters usefulness. This first post will present time share generalities, the second one will be more specific about this parameter.

HP-UX scheduling policies can be classified using different criteria. One of them is priority range. Basically, processes can belong to one of 3 priorities ranges: rtprio and rtsched are real-time, HP-UX time share may be either real time or not. I am only interested here by the latest: HPUX_SCHED_NOAGE only interacts with the time share mechanism, not with the POSIX (rtsched) or rtprio (the oldest HPUX real time system) schedulers.

Time-share scheduling

This is the default HP-UX scheduler: its priority range is from 128 (highest) to 255 (lowest). It is itself divided between system (128-177) and user priorities (178-255). I’ll start to explain what happens to processes when HPUX_SCHED_NOAGE is not set in order to build up the case for using it.

1. Normal time-share scheduling (HPUX_SCHED_NOAGE unset)

Each CPU has a run queue, also sometimes designated as the dispatch queue (each queue length can be seen by sar –qM). Let’s start with a mono processor: processes using the normal time-share scheduling will experience priority changes over time.

CPU time is accumulated for the running thread (interestingly, this could lead to “phantom” thread using for a transaction less than 10ms (a tick) of CPU time, for which the probability of priority decay would be low). Every 40ms (4 ticks), the priority of the running thread is lowered. The priority degradation is a function of the CPU time accumulated during those 40ms (I do not know what the function formula is, but it is linear). Finally, a thread which has been running for TIMESLICE (default 10) x 10ms will be forced out (this is a forced context switch, seen in Glance). TIMESLICE is a changeable HP-UX kernel parameter. Thus, the default forced context switches will by occurs after a 100ms quantum. The thread waiting in the run queue having the highest priority will then be dispatched. Threads waiting (for a semaphore, an I/O, or any other reason) will regain their priority exponentially (I do not know the formula either).

Some known problems:

  • The scheduler assumes that all threads are of equal importance: this can be corrected by using nice(1) to slow down or accelerate the priority change when the process has no thread running.
  • CPU-hungry transactions tend to be bogged in low priorities: the more they need CPU, the less they get it.

ts.gif


What happens with SMPs? I covered the load balancing algorithm in

https://christianbilien.wordpress.com/2007/03/04/hp-ux-processor-load-balancing-on-smps and https://christianbilien.wordpress.com/2007/03/05/hp-ux-processor-load-balancing-on-smps-22.

Please read the next post which will be more specific about HPUX_SCHED_NOAGE.

Advertisements

March 28, 2007

Log buffer issues in 10g (2/2)

Filed under: Oracle — christianbilien @ 7:29 pm

 

The first post on this topic was a quick recap of log buffer generalities. The second will address what’s new in 10gR2.

Part 2: Destaging the log buffer

First bullet: the log buffer is NOT tunable anymore in 10gR2 (whatever the LOG_BUFFER initialization parameter)

From Bug 4592994:

In 10G R2, Oracle combines fixed SGA area and redo buffer [log buffer] together. If there is a free space after Oracle puts the combined buffers into a granule, that space is added to the redo buffer. Thus you see redo buffer has more space as expected. This is an expected behavior.

In 10.2 the log buffer is rounded up to use the rest of the granule. The granule size can be found from the hidden parameter “_ksmg_granule_size”. The log buffer size and granule size can be read from v$sgainfo:

SQL> select * from v$sgainfo;

NAME BYTES RES

——————————– ———- —
Fixed SGA Size 2033832 No
Redo Buffers 14737408 No
Buffer Cache Size 3405774848 Yes
Shared Pool Size 520093696 Yes
Large Pool Size 16777216 Yes
Java Pool Size 33554432 Yes
Streams Pool Size 0 Yes
Granule Size 16777216 No
Maximum SGA Size 3992977408 No
Startup overhead in Shared Pool 234881024 No

 

The log buffer can also be read from:

sql> select TOTAL_BUFS_KCRFA from x$kcrfstrand;

or from the lgwr log file (not the same instance):

ksfqxc:ctx=0x79455cf0 flags=0x30000000 dev=0x0 lbufsiz=1048576 bp=0x6497928

As a matter of fact, 10 or 16MB are huge values. Given the LGWR writes rule (see first post), LGWR will only begin performing background writes when the 1MB threshold is reached if no user transaction commits in the mean time. Log file parallel write are bound to appear.

This circumstance often appears in PL/SQL which contains COMMIT WORK: PL/SQL actually does commit, but does not wait for it. In other words, a COMMIT WAIT within PL/SQL is actually a COMMIT NOWAIT, but the wait is performed when the PL/SQL block or procedure exits to the calling program. This is where _LOG_IO_SIZE shows its usefulness: _LOG_IO_SIZE is a divisor of the log buffer size, and will trigger the LGWR to write whenever (log buffer size divided by _LOG_IO_SIZE > size of pending redo entries). The LGWR write threshold defaults in 9i to 1MB or 1/3 of the log buffer, whichever is less. kcrfswth from “oradebug call kcrfw_dump_interesting_data” is the threshold: it is expressed in OS block size, thus the current _LOG_IO_SIZE can be derived from kcrfswth.

LEBSZ from x$kccle is the operating system block size (512 bytes here).

SQL> oradebug setospid 3868;

Oracle pid: 11, Unix process pid: 3868, image: oracle@mymachine (LGWR)

SQL> oradebug unlimit

Statement processed.

SQL> oradebug call kcrfw_dump_interesting_data

SQL> oradebug tracefile_name

/oracle/10.2.0/admin/MYSID/bdump/eptdb1_lgwr_3868.trc

SQL> !grep kcrfswth /oracle/10.2.0/admin/MYSID/bdump/eptdb1_lgwr_3868.trc

kcrfswth = 2048

SQL> select max(LEBSZ) from x$kccle;

MAX(LEBSZ)

———-

512

SQL> select value from v$parameter where name = ‘log_buffer’;

VALUE

——————————————————————————–

14289920

SQL> select 14289920/512/2048 from dual;

14289920/512/2048

—————–

13.6279297

The threshhold will be about 1/13e of the log buffer size (roughly 1MB here).

Log buffer issues in 10g (1/2)

Filed under: Oracle — christianbilien @ 7:19 pm

Traditional log buffer tuning has been changed in 10gR2, which prompts some problems for highly intensive DB. The first post is a quick recap of log buffer generalities, the second will focus on 10gR2 changes and suggest some log buffer management tips.

Part 1: log buffer generalities

Undersized log buffer is bound to cause sessions to fight for space in the log buffer. Those events are seen under the “log buffer space” event (configuration wait class in 10g). Oversized log buffer are the cause of less known issues: infrequent commits database wise allow log entries to pile up in the log buffer, causing high volumes of LGWR I/Os at commit time, longer “log file sync” experienced by foreground processes and “log file parallel writes” by the LGWR. Remember that only four cases (and not 3 as often written – there is one additional known case that is RAC specific) trigger a LGWR write:

  • A transaction commits or rollbacks
  • Every 3 seconds
  • When the log buffer is 1/3 full or the total of redo entries is 1MB (default _LOG_IO_SIZE), whichever case occurs first.
  • A less documented case occurs on RAC infrastructure when a dirty block has to be read by another instance: redo entries associated with a block must be flushed prior to the transfer. This event is called write ahead logging.

To wrap up the case for log buffer tuning, a couple of latches must be reckon with: the redo writing and redo copy latches which may cause SLEEPS (session waiting) when the LGWR is overactive.

 

March 25, 2007

HP-UX vpar memory: granule size matters

Filed under: HP-UX — christianbilien @ 6:55 pm

Memory is normally assigned to vPars in units called granules (although the vparcreate/vparmodify commands specify memory in multiples of 1MB, the vPar monitor will round up to the next multiple of granule size). As the granule size is specified when the vPar database is created and can not be changed without recreating the virtual partitions database, care must be taken to choose an appropriate granule size when the first vpar is created. Since this is a fairly complex subject, I thought the rules would deserve a note.

PA-RISC

Each vPar will require one ILM granule below 2GB to load its kernel. vpmon uses one granule below 2GB. Therefore (2GB ÷ granule size) -1 = maximum number of vPars. For example, 7 is the maximum number of vPars for an ILM granule size of 256MB (2GB ÷256MB -1 = 7).

Integrity (Itanium)

There is a platform dependent maximum to the number of granules of CLM/cell and of ILM per nPar. These values can be displayed using the vparenv command. Remember that Memory Size ÷ Granule size <= maximum # of granules.

 

Example:

# vparenv

vparenv: The next boot mode setting is “vPars”.
vparenv: The ILM granule size setting is 128.
vparenv: The CLM granule size setting is 128.
vparenv: Note: Any changes in the above settings will become effective only after the next system reboot.
vparenv: Note: The maximum possible CLM granules per cell is 512.
vparenv: Note: The maximum possible ILM granules for this system is 1024

Given the values above the total amount of CLM per cell must be less than 64GB (512 * 128MB) and the total amount of ILM in the nPar must be less than 128GB (1024*128 MB) .

Matching firmware and vpar granule size (Integrity only)

On Integrity systems the memory is divided into granules by the firmware. It is critical that the firmware value for the granule size matches the size in the vPars database. You can examine and modify the firmware setting using the vparenv command. For PA-RISC systems the memory is divided by granules by the monitor and there is no firmware setting. . You can ensure the firmware is updated with the same size as the database by specifying the y option: : vparcreate -g ilm:Mbytes:y –g clm:Mbytes:y. I am not sure what use can be made of diverging granule sizes.

Memory partitioning strategy: avoiding design traps on high end HP-UX systems. CLM and ILM (2/2)

Filed under: HP-UX — christianbilien @ 6:12 pm

 

CLM and ILM

As seen in the first post on this topic , since HP-UXiv2, and only when cells are dual-core capable (PA-RISC or Itanium 2), it is possible to identify memory on a cell or across an nPar as noninterleaved. This is called Cell-Local Memory, or CLM. CLM can be configured as a quantity or percentage of an individual cell’s memory, or a quantity or percentage of the memory across the entire nPar. Interleaved memory (ILM) is used when a portion of memory is taken from cells of the system and is mixed together in a round robin fashion. With processors on various cells accessing interleaved memory the average access time will be uniform. In 11i v1 all memory is designated as ILM.

The designation of memory as ILM vs. CLM is done at the nPar level (parcreate or parmodify). You can then allocate it to one or more of your vPars (vparcreate or vparmodify).

Cell local memory (CLM) can still be accessed by any processor, but processors on the same cell will have the lowest access latency. Access by processors in other cells will have higher latencies. It is always better to use ILM than accessing CLM configured in another cell.Note that CLM can be used to handle the case when there is an uneven amount of memory in the cells: the delta would be configured as CLM.

Psets

CLM and Processor Sets (Psets) can be used together to avoid the inconsistencies of ccNUMA almost entirely. In this context, locality domain (ldom) is defined as the CPUs and memory required to run a thread. A Pset is a logical grouping of CPUs, a CPU partition so to speak. Oracle processes bound to a given Pset get thread run time only on the CPUs assigned to the given Pset. ccNUMA is eliminated because the data and CPUs are on the same cell or ldom.

Memory partitioning strategy: avoiding design traps on high end HP-UX systems (1/2)

Filed under: HP-UX — christianbilien @ 5:41 pm

I already mention how important logical I/O (see “Why you should not underestimate the LIO impact on CPU load” ), knowing that most data base systems need much more CPU to access memory than to execute actual code.

Like most high end servers, HP-UX servers use cells (domains in the Sun Solaris world), where CPU access to local memory access is must faster than to pages outside the cell memory scope. This is the behaviour known as Cache Coherent Non-Uniform Memory Access or ccNUMA.

To reduce wait time in run queues of busy CPUs (see “HP-UX processor load balancing on SMPs”), the system scheduler can decide to move threads to other CPUs on any cell in the same nPar; data interleaved memory can be fragmented among different cells; therefore, a thread has about the same chance of its CPU and data being on the same cell as it does of being on different cells. Different threads of the same process could have different memory reference times to the same portion of a data object, and different parts of a data object can have different memory reference times for the same thread.

Starting in HP-UX 11i v2, memory on a cell or across an entire nPar can be identified as interleaved (the default) or cell-local (non-interleaved). Both can be identified as quantities or percentages at nPar creation time or after creation with a modification and reboot.

Crossbar latency is really what ccNUMA is about on HP servers. When a CPU and memory are on the same cell, crossbar latency is null. Crossbar latency is at its lowest when the CPU and the memory being accessed are on different cells that share the same crossbar port. There is additional latency between cells in the same quad but different cell ports. The worst case is being between cell cabinets on a Superdome.

According to HP figures, memory latency (transfer time between memory and CPU), is 185ns on an sx2000 chipset running Itanium 2 Montecito CPU when memory access is local, or when interleaved with 4 or 8 cores on a single cell. The worst case (crossing cabinets) brings memory latency down to a whopping 397ns (64 cores interleaved).

The second post will consider Cell-local vs Interleaved memory.

 

March 23, 2007

Storage array bottlenecks

Filed under: Storage — christianbilien @ 8:39 pm

Even if the internal communications of a switch may be “oversubscribed” when the aggregate speed of the ports exceeds the internal bandwidth of the device, many switch devices (at least the director class) are “non-blocking”, meaning that all ports can operate at full speed simultaneously. I’ll write a post one day on SAN bottlenecks, but for now here is a view of the main hardware bottlenecks encountered in storage arrays:

Host ports:

The available bandwidth is 2 or 4Gb/s (200 or 400MB/s – FC frames are 10 bytes long -) per port. As load balancing software (Powerpath, MPXIO, DMP, etc.) are most of the times used both for redundancy and load balancing, I/Os coming from a host can take advantage of an aggregated bandwidth of two ports. However, reads can use only one path, but writes are duplicated, i.e. a host write ends up as one write on each host port.

Below is an example of a couple of host ports on an EMC DMX1000 (2Gb/s host ports).

fa4a0.giffa13a0.gif

Thanks to Powerpath, the load is well spread over the two ports. Both ports are running at about half of the bandwidth (but queuing theory shows that queues would start to be non negligible when the available bandwidth reaches 50%).

Array service processors

Depending on the hardware maker, service processors may be either bound Not all SPs are bound toto specific volumes, or may be accessed by any of the array SP. I wrote a blog entry sometimes ago about SP binding. Higher end arrays such as DMX and XP are not do not bind Luns to SPs, whilst Clariion and EVA do.

Back end controllers

Back end controllers are the access point for disk FC loops. Backend controllers also have a given throughput, usually limited anyway by the fact that at a given point in time, the dual ported disks within the FC-AL loop only allow 2 senders and 2 receivers. Below is a DMX1000 controller utilization rate, where almost all disks are running at a minimum of 60% of their available bandwidth, with 30 RAID 10 disks in each loop.

 controller1.gif

 

Disks

From a host standpoint, disks can sustain a much higher utilization rate when they are behind a cache than when they are accessed directly: remember than a disk running at a utilization rate of 50% will queue on average one I/O out of 2 (seen from the host, the I/O service time will be on average 50% higher than the disk service). It is not uncommon to measure disks utilization rates of nearly 100%. This will only becomes a problem when the array cache stops buffering the I/Os because space is exhausted.

 disk1.gif

 

 

March 22, 2007

A handy Solaris 10 command: fcinfo

Filed under: Solaris,Storage — christianbilien @ 4:14 pm

 

One the most useful new command I found in Solaris 10 is fcinfo, a command line interface that will display information on HBA ports on a host, but also many useful bits of information on connected storage remote port WWN, raid type, link status,etc.

root # fcinfo hba-port -l

HBA Port WWN: 10000000c957d408 ==> Local HBA1

OS Device Name: /dev/cfg/c4

Manufacturer: Emulex

Model: LP11000-E

Type: N-port

State: online

Supported Speeds: 1Gb 2Gb 4Gb

Current Speed: 2Gb

Node WWN: 20000000c957d408

Link Error Statistics:

Link Failure Count: 0

Loss of Sync Count: 37

Loss of Signal Count: 0

Primitive Seq Protocol Error Count: 0

Invalid Tx Word Count: 32

Invalid CRC Count: 0

HBA Port WWN: 10000000c957d512==> Local HBA2

OS Device Name: /dev/cfg/c5

Manufacturer: Emulex

Model: LP11000-E

Type: N-port

State: online

Supported Speeds: 1Gb 2Gb 4Gb

Current Speed: 2Gb

Node WWN: 20000000c957d512

Link Error Statistics:

Link Failure Count: 0

Loss of Sync Count: 41

Loss of Signal Count: 0

Primitive Seq Protocol Error Count: 0

Invalid Tx Word Count: 32

Invalid CRC Count: 0

/root # fcinfo remote-port -sl -p 10000000c957d512 ==> Which luns are seen by HBA2 ?

Remote Port WWN: 5006016839a0166a

Active FC4 Types: SCSI

SCSI Target: yes

Node WWN: 50060160b9a0166a

Link Error Statistics:

Link Failure Count: 1

Loss of Sync Count: 1

Loss of Signal Count: 11

Primitive Seq Protocol Error Count: 0

Invalid Tx Word Count: 510

Invalid CRC Count: 0

LUN: 0

Vendor: DGC

Product: RAID 10

OS Device Name: /dev/rdsk/c5t5006016839A0166Ad0s2

LUN: 1

Vendor: DGC

Product: RAID 5

OS Device Name: /dev/rdsk/c5t5006016839A0166Ad1s2

LUN: 2

Vendor: DGC

Product: RAID 10

OS Device Name: /dev/rdsk/c5t5006016839A0166Ad2s2

LUN: 3

Vendor: DGC

Product: RAID 10

OS Device Name: /dev/rdsk/c5t5006016839A0166Ad3s2

March 21, 2007

Storage array service processors (and controllers) performance

Filed under: Storage — christianbilien @ 9:30 pm

Depending on the hardware maker, service processors may be either bound to specific volumes, or may be accessed by any of the array SP. Whilst the second category of SP balance the incoming load, the first requires careful planning: although luns may be accessed through both controllers, there are performance differences depending on which path is used. Within the array, each virtual disk is owned by one of the controllers which provides the most direct path to the virtual disk. EMC Clariion and HP’s EVA both belong to this category, while higher end arrays such as EMC DMX and HP’s XP SPs do not require a volume to SP binding at configuration time.

Read I/Os to the owning controller are executed and the data is returned directly to the host. Read I/Os to the non-owning controller must be passed to the owning controller for execution. The data is read and then passed back to the non-owning controller for return to the host.

Because write I/Os always involves both controller for cache mirroring, write performance will be the same regardless of which controller receives the I/O.

Ownership transitions

Both the Clariion and the EVA will allow explicit ownership transition from one controller to the other. However, implicit transfer may also occur, if one of the following conditions is met:

  • EVA only: the array detects high percentage of read I/O processed by the non-owning controller
  • Both arrays: controller fails and the remaining controller assumes ownership of the luns.

 

Clariion Lun to SPn mapping:

You can find for a given Clariion WWN the port number and SP on which the lun is connected:

fcinfo on a Solaris gives for LUN0:

LUN: 0
Vendor: DGC
Product: RAID 10
OS Device Name: /dev/rdsk/c5t5006016839A0166Ad0s2

The special file name contains the Clariion WWN. Ignore 5006016 which always appear at the beginning of a Clariion WWN, and take the 8th digit (8 here). It can be a hex number from 0 to F, where values 0 through 7 are for SPA, and 8 through F are SP B. Values below 8 (SPA) can be used as such, while 8 must be subtracted from values between 8 and F. 5 is SPA5, c is SPB4.

 

March 14, 2007

SQL can execute in wrong schema (10.2)

Filed under: Off topic,Oracle — christianbilien @ 4:54 pm

Symptoms are: wrong results from SELECT AND data corruption for DML

Although this post is off-topic, I think it is worth sharing one of the nastiest Oracle bug I ever encountered (I am not one of the Oracle pioneer, but I have nonetheless seen a number of them in my life). This is documented in Note:392673.1, which says it is only appearing in 10.1, (for which the manifestations are very rare), but with a much higher chance of occurring in 10.2.

This problem can occur only if all of the following are true:

  • The SQL statement refers to some object without using a fully qualified object name
  • The unqualified object name resolves to a different underlying object for different users.
  • Sessions in different schemas execute IDENTICAL SQL sentences

The note says that the problem is related to reload of aged out cursors, and indeed I experienced it because statistics were calculated before the above conditions were met.

Next Page »

Create a free website or blog at WordPress.com.