Christian Bilien’s Oracle performance and tuning blog

March 14, 2007

SQL can execute in wrong schema (10.2)

Filed under: Off topic,Oracle — christianbilien @ 4:54 pm

Symptoms are: wrong results from SELECT AND data corruption for DML

Although this post is off-topic, I think it is worth sharing one of the nastiest Oracle bug I ever encountered (I am not one of the Oracle pioneer, but I have nonetheless seen a number of them in my life). This is documented in Note:392673.1, which says it is only appearing in 10.1, (for which the manifestations are very rare), but with a much higher chance of occurring in 10.2.

This problem can occur only if all of the following are true:

  • The SQL statement refers to some object without using a fully qualified object name
  • The unqualified object name resolves to a different underlying object for different users.
  • Sessions in different schemas execute IDENTICAL SQL sentences

The note says that the problem is related to reload of aged out cursors, and indeed I experienced it because statistics were calculated before the above conditions were met.

Advertisements

March 12, 2007

High CPU usage during parse on a 9.2.0.6: bitmap transformations to blame, improved by setting _B_TREE_BITMAP_PLANS to FALSE

Filed under: Oracle — christianbilien @ 9:59 am

Here is the CPU profile of the parse phase on a SPARC IIII you get from a number of SELECTs that comes out of the Siebel (now bought by Oracle Corp.) Customer Relationship Management (version 7.7.2.2 on Oracle 9.2.0.6):

PARSE #1:c=2630000,e=3156934

One of the problems this application has is the high number of hard parses and as seen above their cost, even several days after the DB started. This product can be seen as SQL generator whose syntax is dynamically assembled according to the user screen entries.

Knowing that 42 tables were nested loop joins, I suspected a lot of optimizer permutations (the default 9.2.0.6 optimizer_max_permutations is 2000, a limit which will be easily reached). A 10053 trace is worth several hundred pages, but a grep shows that many of the options considered are bitmap conversions to and from rowids. I tried to set both OPTIMIZER_MAX_PERMUTATIONS to 100 and _B_TREE_BITMAP_PLANS to false:

alter session set “_b_tree_bitmap_plans”=false;

PARSE #1:cpu=1680000,elapsed=2060159 (a 35% improvement).

Changing OPTIMIZER_MAX_PERMUTATIONS to 100 did not sped up the SQL call, whatever the _B_TREE_BITMAP_PLANS value.

It seems that the optimizer is using an unusual quantity of CPU cycles to evaluate the b-tree to bitmap transformations options.

Just to make things plain clear, I looked in Metalink and googled “B-tree conversions”,” _B_TREE_BITMAP_PLANS” and a few other similar keywords: it looks like many inefficient plans appeared in 9i, because of Bitmap conversions (I translated that in “when people migrated from RBO to CBO”). This is not what I experienced here: my problem was not the execution phase (for which incidentally no bitmap transformation was used), but the parse phase.

Bug 3537086 seems to be the best candidate for this problem, but it is not publicly documented enough to be affirmative.

What is bitmap conversion (to and from ROWIDS)? It is basically a B-tree transformation to bitmap (hence no bitmap indexes involved here). Using the bitmap conversion, the CBO will convert each set of rowids into an array of bits, which will be used exactly as it would be in a bitmap index operation. Access to table data is performed after the join using a bitmap conversion to rowids.

March 6, 2007

Why you should not underestimate the LIO impact on CPU load

Filed under: Oracle — christianbilien @ 4:51 pm

The role of Logical I/O in CPU consumption is often underestimated. Here is a practical case:

We have a poorly designed partitioning strategy which does not allow for any partition pruning. In other words, the range partitioning strategy will never contain an equal, between or any other clause that could be used by the optimizer for pruning. However, the number of active transactions is fairly low, 2 or 3 on average. A pretty good strategy had been so far to parallelize reads across the partitions. The downside is of course a CPU hog on top of a huge I/O throughput.

So our baseline before Feb 28th is a busy system, a low run queue and pretty good response times.

cpuload.JPG

March 1st was a busy day, an unusual number of inserts where performed in a table which holds many indexes. The insert profile is:

 

Buffer gets: 16,739,612

Executions: 167,979

CPU Time (s): 904.05

This shows that each buffer get “costs” 54 micro seconds on my SPARC IV domain (interestingly close to Cary Millsap’s 53 microseconds in “Why You Should Focus on LIOs Instead of PIOs”, although his article was written in 2001 or 2002 – wonder why we are putting that much million euros to buy those huge boxes – just joking, this time is of course dependent upon the sql call and the underlying objects).

 

So here we are, 100 000 inserts will cost approximately:

100000 x 50 microseconds x gets per insert (100 here) = 500s, 200000 inserts will cost 1000s, and so forth.

By the way, what is an Oracle LIO and why does it cost 50 microsecond when a memory page access time is roughly 100 nanoseconds?

Quote from “O’Reilly’s Optimizing Oracle Performance”:

 

An operation in which the Oracle kernel obtains and processes the content of an Oracle block from the Oracle database buffer cache. The code path for an Oracle LIO includes instructions to determine whether the desired block exists in the buffer cache, to update internal data structures such as a buffer cache hash chain and an LRU chain, to pin the block, and to decompose and filter the content of the retrieved block. Oracle LIO operations occur in two modes: consistent and current. In consistent mode, a block may be copied (or cloned) and the clone modified to represent the block at a given point in history. In current mode, a block is simply obtained from the cache “as-is.”

 

February 26, 2007

Calling DDL from PL/SQL: you can’t avoid paying the parsing price

Filed under: Oracle — christianbilien @ 1:41 pm

Calling DDL from PL/SQL can be done using either of two options (if you find a third one, drop me a comment): you can call the DBMS_SQL package (available for a long time), or the newer “EXECUTE IMMEDIATE”. Let’s assume that you want to rewrite this piece of code:

begin

for i in 1..10000

loop

select order_detail_od_id_seq.nextval into ood_id from dual;

…. some action into an array indexed by i….

end loop;

end;

into:

begin

for i in 1..batch_size

loop

…. some action into an array indexed by i ….

end loop;

end;

execute immediate ‘alter sequence od_id_seq_test increment by 10000’ ;

This is much faster (the execute immediate is about 500 times faster that doing 10000 select .. .nextval…), but still not enough. The execute immediate takes about 27ms on a SPARC IV 1.3Ghz, of which 26ms is parse time. I would have liked to cut this down to a few ms. My idea was to prepare the SQL call using DBMS_SQL.PARSE, and then call DBMS_SQL.EXECUTE from the main program.

create or replace procedure testseq1(cur out number) as

BEGIN

cur := DBMS_SQL.OPEN_CURSOR;

DBMS_SQL.PARSE(cur,’alter sequence od_id_seq_test increment by 10000′, DBMS_SQL.NATIVE);

END;

create or replace procedure testseq2(cur in number) as

rc integer;

begin

rc := DBMS_SQL.EXECUTE(cur);

end;

 

alter session set events ‘10046 trace name context forever, level 12’;

declare cur integer;

begin

testseq1(cur);

—- testseq2(cur); Test for the time being

END;

This is where it gets interesting. Here is an excerpt of the trace:

alter session set events ‘10046 trace name context forever, level 12’

PARSING IN CURSOR #1 len=67 dep=0 uid=0 oct=47 lid=0 tim=14667711607899 hv=2605459378 ad=’e4653ed8′

declare cur integer;

begin

testseq1(cur);

—- testseq2(cur);

END;

alter sequence od_id_seq_test increment by 10000

select increment$,minvalue,maxvalue,cycle#,order$,cache,highwater,audit$,flags from seq$ where obj#=:1

update seq$ set increment$=:2,minvalue=:3,maxvalue=:4,cycle#=:5, order$=:6,cache=:7,highwater=:8,audit$=:9, flags=:10 where obj#=:1

update obj$ set obj#=:6,type#=:7,ctime=:8,mtime=:9 …

In other words, I am NOT calling DBMS_SQL.EXECUTE, but DBMS_SQL.PARSE did execute the call. Even funnier, a subsequent call to DBMS_SQL.EXECUTE does not do anything!

Metalink note 1008453.6 indeed says that the DBMS_SQL.PARSE will not only parse, but also execute the DDL.

So until now, I have to live with an “execute immediate (DDL)”!

February 22, 2007

Trace analyzer vs Tkprof : beware of the recursive calls

Filed under: Oracle — christianbilien @ 10:02 am

Metrology (the science of measurement) is often underestimated. I came across a splendid example of ‘know what you are measuring’ today: I always use the trace analyzer in place of tkprof nowadays. However, a customer had already traced a session and came to me with tkprof metrics to work on. I spotted a huge difference in cpu and elapsed time on a few calls between statspack and the tkprof results.  Tkprof indeed does NOT account for recursive statements, and the time spent performing triggers is subtracted out of the statement.

An insert is triggering a select.

Here is the trace analyzer vs tkprof output, both for the insert and the underlying select. “Per insert” are the result of the total cpu and elapsed time divided by the number of calls.

Trace analyzer:

Per insert

elapsed (ms)

cpu (ms)

Insert

59

14

select (trigger)

26

6

Tkprof:

Per insert

elapsed (ms)

cpu (ms)

Insert

5,74

2,01

select (trigger)

26,14

5,84

February 11, 2007

Join the BAARF Party (or not) (2/2)

Filed under: Oracle,Storage — christianbilien @ 9:33 pm

The mere fact that full stripe aggregation can be done by modern storage arrays (as claimed by vendors) removes the RAID 5 overhead, making RAID10 less attractive (or not attractive at all), thereby dramatically reducing storage costs.

Is it true ?

Before jumping on the stripe aggregation tests, you may find it useful to read the first post I wrote on striping and RAID.

I tried to prove on pure sequential writes on an EMC cx600 that full stripe aggregation exists for small stripes (5*64K=360K) on a Raid 5 5+1 (i.e. almost no small write penalty has to be paid). However, once you increase the stripe size (up to a RAID 5 10+1), the full stripe aggregation just vanished.

Test conditions:

The Navisphere Performance Analyzer is used for measuring disks throughputs. It does not provide any metric that show whether full stripe aggregation is performed or not. So I just generated write bursts, and did some maths (I’ll develop this in another blog entry later on this), based on the expected reads generated by raid 5 writes.

Number of reads generated by writes on a RAID 5 4+1 array, where:

n· : number of stripe units modified by a write request

r : Number of stripe units read as a result of a request to write stripe units.

n

r

Explanation

1

2

Read one stripe unit and the parity block

2

2

Read two additional stripe units to compute the parity

3

1

Read one more stripe units to compute the parity

4

0

No additional reads needed

You can compute that way the number of reads (r) as a function of the number of stripes (n).

<!–[if !vml]–>The stripe unit is always 64k, the number of columns in the raid group will be 5+1 in test 1, 11+1 in test 2.

Test 1:

Small stripes (360K) on a Raid 5 5+1

I/O generator: 64 write/s for and average throughput of 4050kB/s. The average I/O size is therefore 64Kb/s. No read occurs. The operating system buffer cache has been disabled.

Assuming write aggregation, writes should be performed as a 320KB (full stripe) unit. Knowing the Operating System I/O rate (64/s), and assuming that we aggregate 5 OS I/O in one, we can calculate that the Raid Group I/O rate should be 64/5 = 12,8I/O/s. Analyzer gives for a particular disk an average throughput of 13,1. Write aggregation also means that no reads should be generated. The analyzer reports 2,6 reads/s on average. Although very low, this shows that write aggregation may not always be possible depending of particular cache situations.

Test 2:

Large stripes (768K)

82MB are sequentially written every 10s in bursts.

1. Write throughput at the disk level is 12,6Write/s, read throughput=10,1/s. Knowing the no reads are being sent from the OS, those figures alone show that full stripe write aggregation cannot be done by the Clariion.

2. However, some aggregation (above 5 stripes per write) is done as read throughput would otherwise be even higher: no aggregation would mean an extra 25 extra reads per seconds. The small write penalty does not greatly vary anyway when the number of stripes per write is above 4.

February 10, 2007

Join the BAARF Party (or not) (1/2)

Filed under: Oracle,Storage — christianbilien @ 6:38 pm

This is the title of the last chapter of “Oracle Insights, Tales of the Oak Table”, which I just completed reading. The “Battle Against Any Raid Five” (http://www.baarf.com) was also mentionned in “Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning”. So is Raid 5 really so bad ? I’ll start with generalities, and I’ll put in a second post some tests of stripe aggregation I did

1. Large reads, i.e. reads of a full stripe group, can be done in parallel, by reading from the four disks that contain the stripe group.

2. Small reads, i.e. reads of one stripe unit, exhibit good performance because they only tie up one disk, therefore allowing other small reads to other disks to proceed in parallel.

3. Large writes, i.e. writes of a full stripe group require writing into five disks : the four data disks and the parity disk for the stripe unit. Raid 5 MR3 optimization is implemented for example in EMC Clariions such as the CX700 : the optimization works to delay writing to the cache until a RAID 5 stripe has filled, at which time a modified RAID 3 write (MR3) is performed. The steps are : entire stripe is XORed in memory to produce a parity segment, and the entire stripe, including new parity is written to the disks. In comparaison to a mirrored stripe scheme (RAID10), when writing a full stripe, the RAID 5 engine will write to N+1 disks, the RAID10 to 2*N disks.

– Large I/O stripe detection : If a large I/O is received, the RAID engine detects whether it can fill a stripe, and write the stripe out in MR3 fashion.

– The RAID engine detects data being written to the LUN that is sequential and delay flushing cache pages for a stripe until the sequential writes have filled a stripe.

4. Small writes, i.e. writes to one stripe, require that the parity block for the entire stripe unit be recomputed. Thus, a small write require reading the stripe unit and the parity block (hopefully in parallel), computing the new parity block and writing the new stripe unit and the new parity block in parallel. On a RAID 5 4+1, a write to one stripe unit actually requires four I/Os. This is known as the small write penalty for RAID 5 disks.

5. If the cache is saturated, the RAID 1+0 allows more writes to that system before cache flushing increases reponse time

6. Alignment : It is always desirable (but seldom feasible) to align I/Os on stripe elements boundaries to avoid disk crossings (RAID stripe misaligment). A single I/O split across two disks will actually incur two I/Os on two different stripes. This is even more costly with RAID 5 as there is an additional stripe’s worth of parity to calculate. On Intel architecture systems (at least the Xeon, it is likely to be different for the Itanium), the placement of the Master Boot Record (MBR) at the beginning of each logical device causes subsequent data structure to be misaligned by 63 sectors (or 512K block). The LUN aligment offset must be specified in Navishere to overcome this problem.

Storage array cache sizing

Filed under: Oracle,Storage — christianbilien @ 11:28 am

I often have to argue or even fight with storage providers about cache sizing. Here is a set of rules I apply in Proof Of Concepts and disk I/O modellization.

Write cache size:

1. Cache sizing : the real issue is not the cache size, but how fast the cache can flush to disk. In other words, assuming sustained IO, cache will fill and I/O will bottleneck is this condition is met : the rate of incoming I/O is greater than the cache ability to flush.

2. The size of a write cache matters when the array must handle a burst of write I/O. A larger cache can absorb bigger bursts of write, such as database checkpoints. The burst can then be contained without hitting forced flush.

3. Write caching from one SP to the other is normally activated for redundancy and single point of failure removal. That is, the write cache in each SP contains both the primary cached data for the logical units it owns as well as a secondary copy of cache data for the LUNs owned by its peer SP. In other words, SP1’s write cache hold a copy of SP2’s write cache and vice versa. Overall, the real write cache size (seen from a performance point of view) is half the write cache configuration.

4. Write caching is used for raid 5 full stripe agregation (when the storage firmware support this feature) and parity calculation, a very useful feature for many large environments. Lack of space must not force the cache to destage.

Read cache size:

1. Randoms reads are little chance to be in cache: the storage array read cache is unlikely to provide much value on top of the SGA and possibly the file system buffer cache (depending of FILESYSTEMIO_OPTIONS and file system direct IO mount options).

2. However, array read caches are very useful for prefetches. I can see three cases for this situation to occur :

  • Oracle direct reads (or direct path reads) are operating system asynchronous reads (I’ll write a new post on this later). Assuming a well designed backend, you will be able to use a large disk bandwidth to take advantage of large numbers of parallel disk I/O.
  • Buffered reads (imports for example) will take advantage both of file system and storage array read ahead operations.

Also take a look at http://www.oracle.com/technology/deploy/availability/pdf/oow2000_same.pdf

February 6, 2007

Cursor_sharing and begin myproc(….) end;

Filed under: Oracle — christianbilien @ 8:15 pm

I owe a lot to Jonathan Lewis’ for his book ‘Cost Based Oracle Fundamentals’, from which I learned many useful informations. Here the latest interesting trick I found in this book :

I had to set cursor_sharing=force in a 10gR2 data base to remove parse calls generated by literals coming from dynamic SQL , but I got stuck on a PL/SQL call for which Oracle was not willing to perform bind variable substitution. It took me a while before realizing that the PL/SQL procedure body was not actually generating the parsing (there was no reason for doing so), but the call itself. The call was written using the old begin mypack.SetMyproc syntax, which I changed to the newer CALL mypack.SetMyproc. Here is the resulting trace :

Before :

PARSING IN CURSOR #2 len=168 dep=0 uid=72 oct=47 lid=72 tim=5753883462601 hv=3807832698 ad=’51cad930′

BEGIN mypack.SetMyproc (‘AZZ.4.2′,’AZZ2′,’BRIDGE2’,121,’8=FIX.4.2^A9=53^…)

END ;

END OF STMT

PARSE #2:c=10000,e=9412,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,tim=5753883462587

After:

PARSING IN CURSOR #1 len=164 dep=0 uid=72 oct=170 lid=72 tim=108036751889 hv=628731915 ad=’37f9ae70′

CALL mypack.SetMyproc (:“SYS_B_00″,:”SYS_B_01″,:”SYS_B_02″,:”SYS_B_03”,…)

END OF STMT

PARSE #1:c=0,e=234,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=108036751882

Bringing this piece of code from 9.4ms down to 0.2ms was great as we are really figthing for milliseconds gains in this application (not so usual for most customers, but not for electronic trading).

« Previous Page

Create a free website or blog at WordPress.com.