Christian Bilien’s Oracle performance and tuning blog

March 13, 2007

Excerpts from my library

Filed under: Misc — christianbilien @ 9:33 am

 

Instead of posting a long and tedious list of performance books, I choose to extract a personal and highly subjective list of the books that really brought me a lot of knowledge. I read other performance books I am not so fond of which are not mentioned here.

  • Modeling and capacity planning fundamentals

Capacity Planning for Web Performance: Metrics, Models, and Methods, by Daniel Menascé, Virgilio A.F. Almeida

 

capacityplanning_web_perf.jpg

 

Scaling for E-Business: Technologies, Models, Performance, and Capacity Planning by Daniel Menascé, Virgilio A.F. Almeida

scaling_e-business.jpg

Performance by Design: Computer Capacity Planning by Example by Daniel Menascé, Virgilio A.F. Almeida and Lawrence W.Dowdy

perf_by_design.jpg

  • Oracle performance

 

Pro Oracle Database 10g Rac on Linux: Installation, Administration And Performance Julian Dyke and Steve Shaw

rac-on-linux.jpg

Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning by Richmond Shee, Kirtikumar Deshpande, and K. Gopalakrishnan

owi.jpeg

Cost-based Oracle Fundamentals by Jonathan Lewis

 

cbo-lewis.jpg

 

Expert Oracle Database Architecture: 9i and 10g Programming Techniques and Solutions by Thomas Kyte

tkyte.jpg

Oracle Insights: Tales of the Oak table by Mogens Norgaard, Dave Ensor, and Anjo Kolk

oak_table1.jpg

 

Optimizing Oracle Performance by Cary Millsap and Jeff Holt

oracle_performance.jpg

 

  • Solaris

Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture by Richard McDougall and Jim Mauro

 

solaris_internals.jpg

Solaris Performance And Tools: DTrace And Mdb Techniques for Solaris 10 And OpenSolaris by Richard McDougall, Jim Mauro, and Brendan Gregg

solaris_perf.jpg

  • HP-UX

Hp-Ux 11I: Tuning and Performance by Robert F. Sauers, Chris P. Ruemmler, and Pander S. Weygant

hp-ux_perf.jpg

Hp-Ux 11I Internals by Chris Cooper and Chris Moore

hp-ux_11i_internals.jpg


Advertisements

March 12, 2007

High CPU usage during parse on a 9.2.0.6: bitmap transformations to blame, improved by setting _B_TREE_BITMAP_PLANS to FALSE

Filed under: Oracle — christianbilien @ 9:59 am

Here is the CPU profile of the parse phase on a SPARC IIII you get from a number of SELECTs that comes out of the Siebel (now bought by Oracle Corp.) Customer Relationship Management (version 7.7.2.2 on Oracle 9.2.0.6):

PARSE #1:c=2630000,e=3156934

One of the problems this application has is the high number of hard parses and as seen above their cost, even several days after the DB started. This product can be seen as SQL generator whose syntax is dynamically assembled according to the user screen entries.

Knowing that 42 tables were nested loop joins, I suspected a lot of optimizer permutations (the default 9.2.0.6 optimizer_max_permutations is 2000, a limit which will be easily reached). A 10053 trace is worth several hundred pages, but a grep shows that many of the options considered are bitmap conversions to and from rowids. I tried to set both OPTIMIZER_MAX_PERMUTATIONS to 100 and _B_TREE_BITMAP_PLANS to false:

alter session set “_b_tree_bitmap_plans”=false;

PARSE #1:cpu=1680000,elapsed=2060159 (a 35% improvement).

Changing OPTIMIZER_MAX_PERMUTATIONS to 100 did not sped up the SQL call, whatever the _B_TREE_BITMAP_PLANS value.

It seems that the optimizer is using an unusual quantity of CPU cycles to evaluate the b-tree to bitmap transformations options.

Just to make things plain clear, I looked in Metalink and googled “B-tree conversions”,” _B_TREE_BITMAP_PLANS” and a few other similar keywords: it looks like many inefficient plans appeared in 9i, because of Bitmap conversions (I translated that in “when people migrated from RBO to CBO”). This is not what I experienced here: my problem was not the execution phase (for which incidentally no bitmap transformation was used), but the parse phase.

Bug 3537086 seems to be the best candidate for this problem, but it is not publicly documented enough to be affirmative.

What is bitmap conversion (to and from ROWIDS)? It is basically a B-tree transformation to bitmap (hence no bitmap indexes involved here). Using the bitmap conversion, the CBO will convert each set of rowids into an array of bits, which will be used exactly as it would be in a bitmap index operation. Access to table data is performed after the join using a bitmap conversion to rowids.

March 6, 2007

Why you should not underestimate the LIO impact on CPU load

Filed under: Oracle — christianbilien @ 4:51 pm

The role of Logical I/O in CPU consumption is often underestimated. Here is a practical case:

We have a poorly designed partitioning strategy which does not allow for any partition pruning. In other words, the range partitioning strategy will never contain an equal, between or any other clause that could be used by the optimizer for pruning. However, the number of active transactions is fairly low, 2 or 3 on average. A pretty good strategy had been so far to parallelize reads across the partitions. The downside is of course a CPU hog on top of a huge I/O throughput.

So our baseline before Feb 28th is a busy system, a low run queue and pretty good response times.

cpuload.JPG

March 1st was a busy day, an unusual number of inserts where performed in a table which holds many indexes. The insert profile is:

 

Buffer gets: 16,739,612

Executions: 167,979

CPU Time (s): 904.05

This shows that each buffer get “costs” 54 micro seconds on my SPARC IV domain (interestingly close to Cary Millsap’s 53 microseconds in “Why You Should Focus on LIOs Instead of PIOs”, although his article was written in 2001 or 2002 – wonder why we are putting that much million euros to buy those huge boxes – just joking, this time is of course dependent upon the sql call and the underlying objects).

 

So here we are, 100 000 inserts will cost approximately:

100000 x 50 microseconds x gets per insert (100 here) = 500s, 200000 inserts will cost 1000s, and so forth.

By the way, what is an Oracle LIO and why does it cost 50 microsecond when a memory page access time is roughly 100 nanoseconds?

Quote from “O’Reilly’s Optimizing Oracle Performance”:

 

An operation in which the Oracle kernel obtains and processes the content of an Oracle block from the Oracle database buffer cache. The code path for an Oracle LIO includes instructions to determine whether the desired block exists in the buffer cache, to update internal data structures such as a buffer cache hash chain and an LRU chain, to pin the block, and to decompose and filter the content of the retrieved block. Oracle LIO operations occur in two modes: consistent and current. In consistent mode, a block may be copied (or cloned) and the clone modified to represent the block at a given point in history. In current mode, a block is simply obtained from the cache “as-is.”

 

March 5, 2007

HP-UX Processor Load Balancing on SMPs (2/2)

Filed under: HP-UX — christianbilien @ 3:18 pm

The first post I wrote on CPU scheduling was describing the circumstances in which thread stealing would be considered. This post will go further down the road: once there is enough cpu idleness, or all CPUs are starving threads, context switches may occur. This post describes the rules enforced by the HP-UX scheduler in versions 11.11 and later.

Before digging into the subject, locality domains (LDOM) should be explained. A locality domain is basically a cell. A cell is made of four single or dual core processors, as well as its own memory. The reason locality domain exist is the inter cell bus latency which may greatly impact memory access time. I’ll write a post one day about HP-UX partitioning which will go a bit more into details.

  • A mundane_balance() iteration is run within each LDOM. Each processor is assigned a score based on load average AND starvation (remember from post 1 that starvation occurs when a thread assigned to a given processor has not been running for ‘a long time’). According to the HP system internals course, starvation is given more importance than load average, which makes sense as a cpu hog will be able to run 80 to 100ms before giving up the processor to another thread. In any case, an idle processor is always one of the best processors.
  • In 11.22, the locality domain balancer is called to potentially move a thread from one domain to another.

The outcome is a pair of “best” and “worst” processors. If the pair has lightly loaded cpus, with a load average of less than 0.2, the system is considered to be well balanced and nothing is done. A thread running on the “worst” processor is otherwise selected (it must not be a real time or locked thread), removed from the run queue and inserted into the “best” processor run queue.

How is this “next” thread selected?

The selection is based on the virtual address of the kthread structure: the purpose of the algorithm is to ensure each thread is cycled through.

March 4, 2007

HP-UX Processor Load Balancing on SMPs (1/2)

Filed under: HP-UX — christianbilien @ 9:33 pm

Processor cache hits (data and instruction) are performance-wise extremely important. The TLB (translation lookaside buffer) is one of the most important CPU component as far as performance is concerned: it is a cache to the virtual to physical address translation process (the HP-UX Page Directory). As often showed, CPU accounts for a large part in SQL calls response times, of which most of it is memory access time. As threads move from one processor to the other, cache lines must be invalidated (if processor N°2 has to update a line loaded in a processor N°1 data cache), or at best reloaded when read access is required by both processors. I’ll write a post in the future about the in and outs of processor affinity. For now, I am interested by the rules that govern CPU switches, and understand what triggers them.

I’ll only consider versions more recent than HP-UX 11.11 :

The routine that does the load balancing is named mundane_balance()). This routine schedules itself into the timeout mechanism to be awakened once a second. It runs as an interruption service routine rather than within the context of another process. Thus it cannot be interrupted by some other event. Nor can it be starved by other real-time threads (it was call from stat_daemon() before 11.11).

A processor is in a state of starvation if it has one or more threads on its run queue that hasn’t executed for a long time (this “long time” varies with the CPU load). Only if there are no processors suffering from starvation, or all processors have starving threads (or could be forced into that condition), does HP-UX considers looking for balancing (see next post: HP-UX Processor Load Balancing on SMPs).

« Previous Page

Create a free website or blog at WordPress.com.