# Christian Bilien’s Oracle performance and tuning blog

## September 20, 2007

### Is dbms_stats.set_xxx_stats insane tuning ? (part 1/2)

Filed under: Oracle — christianbilien @ 11:04 am

The challenge is a very common one: changing the plan of SQL calls

– generated by a vendor package (i.e. you cannot have the code easily changed – so no hints can be used -)
and
– the SQL calls cannot be syntactically predicted either because the calls are assembled according to user input criterions (i.e. no outlines) or there might be a set of calls too numerous to handle one by one
and
– I do not want to alter the initialization parameters (
OPTIMIZER_ INDEX_COST_ADJ and OPTIMIZER_INDEX_CACHING most of the time) because this apply to the whole instance
and

– Dbms_sqltune / Sql Advisor (10g) do not provide the right plan or I am working on a previous version (*)

This is where the dbms_stats.set_xxx_stats comes to my mind, with the fear that one of my interlocutors starts calling me insane once I start muttering “I may have another option”.

Some background might be necessary: ever since I first read some years ago the Wolfgang Breitling papers (http://www.centrexcc.com/) about the 10053 trace and the cost algorithm, followed by a growing enthusiasm at reading Jonathan Lewis “CBO fundamentals”, I started to think that setting statistics was a valid approach provided that three requirements were met:

• There is hardly another way of getting to an acceptable plan (not even necessarily the “best”)
• You know what you are doing and the potential consequences on other plans
• The customer is psychologically receptive: he or she understands what you are doing, the reasons you are doing it that way and the need to maintain the solution or even to discard it in the future.

I’ll explain in this first part the basics to clarify my point. Jump directly to the second part (once it is written) if you already know about cost calculations (or risk being bored).

On the technical side, the most common use of dbms_stats.set_xxx_stats is for me to change index statistics via the set_index_stats procedure to make an index more attractive.

I’ll start with the most basic example to make my point. Let’s consider a select statement such as:

select * from mytable where status_code=’AAAA’;

mytable is not partitioned, status_code is indexed (non-unique) and no histograms were calculated.

In this particular case, having status_code compared to a bind value or a literal does not make any difference as far as cost calculation is concerned, but it does when the predicate is >,<,>=, etc. The most annoying bit is that you will see significant changes to the cost selectivity algorithms in 9i and in each 10g version when the predicates differ from a simple equality. Another glitch is the change in calculations depending of status_code being in bounds or out of bounds (i.e. being or not being between user_tab_col_statistics.low_value and user_tab_col_statistics.high_value). “Check Wolfgang Breitling’s “A look under the hood of the CBO” and Jonathan Lewis’ book for more comprehensive researches.

So let’s get back to our simple select * from mytable where status_code=’AAAA’;

We know the I/O cost of and index-driven access path will be:

Cost=
blevel +
ceiling(leaf_blocks * effective index selectivity) +
ceiling (clustering factor * effective table selectivity)

Blevel, the height of the index is usually a small value compared to the two other factors. Blevel, leaf_blocks and clustering factor are columns of the user_indexes view.

The effective index selectivity and the effective table selectivity are both calculated by the optimizer.

The effective index selectivity is found in a 10053 9i trace file as IXSEL.In our simple select, the index selectivity can be obtained from user_tab_col_statistics or user_tab_columns (the Oracle documentation states that density and num_distinct from user_tab_columns are only maintained for backward compatibility).

The effective table selectivity is found as TB_SEL in the 10053 9i trace (it is IX_SEL_WITH_FILTERS in 10g). It is here with the “=” predicate:

• the density column if an histogram is used. You’ll find that density differs from 1/NUM_DISTINCT when histograms are calculated
• 1/ NUM_DISTINCT (of col1) if not. In this case density would report 1/NUM_DISTINCT.

The third factor is often overwhelmingly dominant (it is not for example if fast full-index scans can be used).It is essential to know which of DENSITY and NUM_DISTINCT are really used, especially in 10g where many data base are left with the default statistics calculations (with histograms).

Let’s see in 9i (9.2.0.7) how this works with cpu costing disabled:

Index access cost

set autotrace traceonly

alter session set “_optimizer_cost_model”=io; è because cpu costing is enabled for this data base

select * from mytable where status_code=’AAAA’;

Execution Plan

———————————————————-

0 SELECT STATEMENT Optimizer=CHOOSE (Cost=20872 Card=690782 Bytes=148518130)
1 0 TABLE ACCESS (BY INDEX ROWID) OF ‘mytable’(Cost=20872 Card=690782 Bytes=148518130)
2 1 INDEX (RANGE SCAN) OF ‘mytableI000003’ (NON-UNIQUE) (Cost=1432 Card=690782)

Let’s make the computation to verify the cost of 20872:

Blevel =2
Leaf blocks=8576
Clustering Factor = 116639
Here, index Selectivity = table selectivity = density= 1/num distinct = 1/6 = 0.166667

leaf_blocks * effective index selectivity = 1429.33
clustering factor * table selectivity = 19439.33

Cost=blevel + ceiling(leaf_blocks * effective index selectivity) + ceiling (clustering factor * effective table selectivity) = 20872

As expected, clustering factor * table selectivity is by far the dominant factor.

Table access cost

The basic formula is derived from “number of blocks below the high water mark” divided by db_file_multiblock_read_count. This formula has to be amended: Oracle actually uses a divisor (an “adjusted dbf_mbrc” as Jonathan Lewis calls it) which is actually a function of both db_file_multiblock_read_count and the block size. W. Breitling inferred in “A look under the hood of the CBO” how the divisor would vary as a function of db_file_multiblock_read_count for a fixed block size.

I used a 16k block size and db_file_multiblock_read_count =16. I derived the adjusted dbf_mbrc by divising the number of blocks below the HWM (46398) by the IO_COST:

alter session set “_optimizer_cost_model”=io;
select /*+ full(d) */ * from mytable d where
status_code=’AAAA’;

Execution Plan

———————————————————-

0 SELECT STATEMENT Optimizer=CHOOSE (Cost=7044 Card=544275 Bytes=111032100)

1 0 TABLE ACCESS (FULL) OF ‘MYTABLE’ (Cost=7044 Card=544275 Bytes=111032100)

A final word: the adjusted dbf_mbrc is about the cost calculation thing not about the actual I/O size.

(*) Jonathan Lewis experienced in the “CBO fundamentals” book a possible tweak in profiles using the OPT_ESTIMATE hint, with a warning of using an undocumented, subject to change feature. So I’ll leave this outside this discussion. Dbms_stats.set_xxx_stats is on the other hand a documented and supported option.

## September 14, 2007

### Strategies for parallelized queries across RAC instances (part 2/2)

Filed under: Oracle,RAC — christianbilien @ 5:52 am

If you have not done so, it would be beneficial to first go through the first post “Strategies for parallelized queries across RAC instances” to get an understanding of the concepts and challenge. I came up with a description of an asymmetric strategy for the handling of RAC inter-instance parallelized queries but still being able to force both the slaves and the coordinator to reside on just one node. The asymmetric configuration is a connexion and service based scheme which allow users

• connected to node1 to execute both the coordinator and slaves on node1, and prevent the slaves to spill on node2
• connected to node2 but unable to do an alter session (because the code belongs to an external provider) to load balance their queries across the nodes
• connected to node 1 and able to issue an alter session to load balance.

Another way of doing things is to load balance the default service to which the applications connect to, and to restrict access to node1 and to node2. This is a symmetric configuration.

Tnsnames and service names are the same as in the asymmetric configuration.

bothnodes =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = myvip1)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = myvip2)(PORT = 1521))
)
(CONNECT_DATA =
(SERVICE_NAME = bothnodestaf)
(FAILOVER_MODE =
(TYPE = SELECT)
(METHOD = BASIC)
(RETRIES = 180)
(DELAY = 5)
)
(SERVER = DEDICATED)
)
)

execute dbms_service.modify_service (service_name => ‘bothnodestaf’ –
, goal => dbms_service.goal_throughput –
, failover_method => dbms_service.failover_method_basic –
, failover_type => dbms_service.failover_type_select –
, failover_retries => 180 –
, failover_delay => 5 –
, clb_goal => dbms_service.clb_goal_short);

Both spfiles were changed:

On node1, spfileI1.ora contains:

INSTANCE_GROUPS =’IG1’,’IGCLUSTER’

PARALLEL_INSTANCE_GROUP=’IGCLUSTER’

On node2, spfileI2.ora contains:

INSTANCE_GROUPS =’IG2’, ’IGCLUSTER’

PARALLEL_INSTANCE_GROUP=’IGCLUSTER’

select inst_id, name, value from gv$parameter where name like ‘%instance%group%’; INST_ID NAME VALUE ———- ——————————————————————————- 1 instance_groups IG1, IGCLUSTER 1 parallel_instance_group IGCLUSTER 2 instance_groups IG2, IGCLUSTER 2 parallel_instance_group IGCLUSTER By default, both nodes will behave in the same way. As the PARALLEL_INSTANCE_GROUP matches one of the INSTANCE_GROUPS on both nodes, load balancing will work by default whatever the node on which the applications connects. On node 1:$sqlplus mylogin/mypass@bothnodes

< no alter session set parallel_instance_group= >

select /*+ full(orders_part) */ count(*) from orders_part;

QCSID Inst Group Set PROGRAM

———- ———- ———- ———- ————————————————
1050 1 1 1 oracle@node1 (P010)
1050 1 1 1 oracle@node1 (P011)
1050 1 1 1 oracle@node1 (P009)
1050 1 1 1 oracle@node1 (P008)
1050 2 1 1 oracle@node2 (P009)
1050 2 1 1 oracle@node2 (P011)
1050 2 1 1 oracle@node2 (P010)
1050 2 1 1 oracle@node2 (P008)
1050 1 sqlplus@node1 (TNS V1-V3)

On node 2:

$sqlplus mylogin/mypass@bothnodes < no alter session set parallel_instance_group= > select /*+ full(orders_part) */ count(*) from orders_part; QCSID Inst Group Set PROGRAM ———- ———- ———- ———- ———————————————— 997 1 1 1 oracle@node1 (P010) 997 1 1 1 oracle@node1 (P011) 997 1 1 1 oracle@node1 (P009) 997 1 1 1 oracle@node1 (P008) 997 2 1 1 oracle@node2 (P009) 997 2 1 1 oracle@node2 (P011) 997 2 1 1 oracle@node2 (P010) 997 2 1 1 oracle@node2 (P008) 997 2 sqlplus@node2 (TNS V1-V3) You may notice a subtle difference: the coordinator runs as expected on the node the service connects to. Node restriction: The tnsnames.ora and the service definition are left unchanged from the tests performed in the previous post. nodetaf1 is set to node1= ‘PREFERRED’ and node2=’NONE’ On node1: Tnsnames.ora: (unchanged) node1-only = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = node1)(PORT = 1521)) ) (LOAD_BALANCING=NO) (CONNECT_DATA = (SERVICE_NAME= nodetaf1) (SID = EMDWH1) (SERVER = DEDICATED) ) )$ sqlplus mylogin/mypass@node1-only

alter session set parallel_instance_group=’IG1′;

select /*+ full(orders_part) */ count(*) from orders_part;

QCSID Inst Group Set PROGRAM
———- ———- ———- ———- ————————————————
984 1 1 1 oracle@node1 (P000)
984 1 1 1 oracle@node1 (P004)
984 1 1 1 oracle@node1 (P002)
984 1 1 1 oracle@node1 (P005)
984 1 1 1 oracle@node1 (P006)
984 1 1 1 oracle@node1 (P001)
984 1 1 1 oracle@node1 (P007)
984 1 1 1 oracle@node1 (P003)
984 1 sqlplus@node1 (TNS V1-V3)

On node 2:

The tnsnames.ora and service definition are symmetric to what they are on node1

$sqlplus mylogin/mypass@node2-only alter session set parallel_instance_group=’IG2′; select /*+ full(orders_part) */ count(*) from orders_part; QCSID Inst Group Set PROGRAM ———- ———- ———- ———- ———————————————— 952 1 1 1 oracle@node2 (P000) 952 1 1 1 oracle@node2 (P004) 952 1 1 1 oracle@node2 (P002) 952 1 1 1 oracle@node2 (P005) 952 1 1 1 oracle@node2 (P006) 952 1 1 1 oracle@node2 (P001) 952 1 1 1 oracle@node2 (P007) 952 1 1 1 oracle@node2 (P003) 952 1 sqlplus@node2 (TNS V1-V3) ## September 12, 2007 ### Strategies for RAC inter-instance parallelized queries (part 1/2) Filed under: Oracle,RAC — christianbilien @ 8:32 pm I recently had to sit down and think about how I would spread workloads across nodes in a RAC data warehouse configuration. The challenge was quite interesting, here were the specifications: • Being able to fire PQ slaves on different instances in order to use all available resources for most of the aggregation queries. • Based on a tnsname connection, being able to restrict PQ access to one node because external tables may be used. Slaves running on the wrong node would fail : insert /*+ append parallel(my_table,4) */ into my_table ERROR at line 1: ORA-12801: error signaled in parallel query server P001, instance node-db04:INST2 (2) ORA-29913: error in executing ODCIEXTTABLEOPEN callout ORA-29400: data cartridge error KUP-04040: file myfile408.12.myfile_P22.20070907.0.txt in MYDB_INBOUND_DIR not found ORA-06512: at “SYS.ORACLE_LOADER”, line 19 I could come up with two basic configurations based on combinations of services, INSTANCE_GROUPS and PARALLEL_INSTANCE_GROUP. This post is about an asymmetric configuration, the second will deal with a symmetric configuration. Parallel executions are services aware: the slaves inherit the data base service from the coordinator. But the coordinator may execute slaves on any instance of the cluster, which means that the service localization (‘PREFERRED’ instance) will create the coordinator on the henceforth designed instance, but the slaves are not bound by the service configuration. So I rewrote “being able to restrict PQ access to one node” into “being able to restrict PQ access and the coordinator to one node” An instance belongs to the instance group it declares in its init.ora/spfile.ora. Each instance may have several instance groups declared: INSTANCE_GROUPS =’IG1’,’IG2’ (be careful not to set ‘IG1,IG2’). PARALLEL_INSTANCE_GROUP is another parameter which can be either configured at the system or at session level. Sessions may fire PQ slaves on instances for which the parallel instance group matches and instance group the instance belongs to. Let’s consider a 2 nodes configuration: node1 and node2. I1 and I2 are instances respectively running on node1 and node2. The Oracle Data Warehouse documentation guide states that a SELECT statement can be parallelized for objects schema created with a PARALLEL declaration only if the query involves either a full table table scan or an inter partition index range scan. I’ll force a full table scan in my tests case to have the CBO decide to pick up a parallel plan: select /*+ full(orders_part)*/ count(*) from orders_part; The configuration described below will allow users : • connected to node1 to execute both the coordinator and slaves on node1, and prevent the slaves to spill on node2 • connected to node2 but unable to do an alter session (because the code belongs to an external provider) to load balance their queries across the nodes • connected to node 1 and able to issue an alter session to load balance. In a nutshell, users connecting to node1 will restrict their slave scope to node1, users connecting to node2 will be allowed to load balance their slaves over all the nodes. I set for the test parallel_min_servers=0: I can then see the slaves starting whenever Oracle decides to fire them. Asymmetric INSTANCE_GROUPS configuration: On node1, spfileI1.ora looks like: INSTANCE_GROUPS =’IG1’,’IG2’ PARALLEL_INSTANCE_GROUP=’IG1’ On node2, spfileI2.ora contains: INSTANCE_GROUPS =’IG2’ PARALLEL_INSTANCE_GROUP=’IG2’ select inst_id, name, value from gv$parameter where name like ‘%instance%group%’;

INST_ID NAME VALUE

———- ——————————————————————————-

1 instance_groups IG1, IG2
1 parallel_instance_group IG1
2 instance_groups IG2
2 parallel_instance_group IG2

Single node access

I configured via dbca nodetaf1, a service for which node1/instance I1 was the preferred node and node was set to “not use” (I do not want to connect on one node an execute on another – thereby unnecessarily clogging the interconnect —) and did the same for node2/instance I2.

SQL> select inst_id,name from gv$active_services where name like ‘%taf%’; INST_ID NAME ———- —————————————————————- 2 nodetaf2 1 nodetaf1 The related TNSNAMES entries for a node1 only access looks like: Node1-only = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = node1)(PORT = 1521)) ) (LOAD_BALANCING=NO) (CONNECT_DATA = (SERVICE_NAME= nodetaf1) (SID = EMDWH1) (SERVER = DEDICATED) ) ) Note that the host name is not the vip address (because there is no point in switching node should node1 fail). Test results:$ sqlplus mylogin/mypass@node1-only

< no alter session set parallel_instance_group= >
select /*+ full(orders_part) */ count(*) from orders_part;

select qcsid, p.inst_id “Inst”, p.server_group “Group”, p.server_set “Set”,s.program
from gv$px_session p,gv$session s
where s.sid=p.sid and qcsid=&1
order by qcinst_id , p.inst_id,server_group,server_set

QCSID Inst Group Set PROGRAM
———- ———- ———- ———- ————————————————

938 1 1 1 oracle@node1 (P000)
938 1 1 1 oracle@node1 (P004)
938 1 1 1 oracle@node1 (P002)
938 1 1 1 oracle@node1 (P005)
938 1 1 1 oracle@node1 (P006)
938 1 1 1 oracle@node1 (P001)
938 1 1 1 oracle@node1 (P007)
938 1 1 1 oracle@node1 (P003)
938 1 sqlplus@node1 (TNS V1-V3)

The coordinator and the slaves stay on node1

Dual node access

I then added a service aimed at firing slave executions on both nodes. The ‘bothnodestaf’ was added using dbca and then modified to give it “goal_throughput” and “clb_goal_short” load balancing advisories: according to the documentation, load balancing is based on rate that works is completed in service plus available bandwidth. I’ll dig into that one day to get a better understanding of the LB available strategies.

execute dbms_service.modify_service (service_name => ‘bothnodestaf’ –
, goal => dbms_service.goal_throughput –
, failover_method => dbms_service.failover_method_basic –
, failover_type => dbms_service.failover_type_select –
, failover_retries => 180 –
, failover_delay => 5 –
, clb_goal => dbms_service.clb_goal_short);

In the tnsnames.ora:

bothnodes =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = myvip1)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = myvip2)(PORT = 1521))

)

(CONNECT_DATA =
(SERVICE_NAME = bothnodestaf)
(FAILOVER_MODE =
(TYPE = SELECT)
(METHOD = BASIC)
(RETRIES = 180)
(DELAY = 5)

)
(SERVER = DEDICATED)
)
)

Note that LOAD_BALANCE is unset (set to OFF) in the tnsnames.ora to allow server-side connection balancing

On node1:

$sqlplus mylogin/mypass@bothnodes alter session set parallel_instance_group=’IG2′; select /*+ full(orders_part) */ count(*) from orders_part; QCSID Inst Group Set PROGRAM ———- ———- ———- ———- ———————————————— 936 1 1 1 oracle@node1 (P002) 936 1 1 1 oracle@node1 (P003) 936 1 1 1 oracle@node1 (P001) 936 1 1 1 oracle@node1 (P000) 936 1 1 1 oracle@node2 (P037) 936 1 1 1 oracle@node2 (P027) 936 2 1 1 oracle@node1 (O001) 936 2 1 1 oracle@node2 (P016) 936 2 1 1 oracle@node2 (P019) 936 2 1 1 oracle@node2 (P017) 936 2 1 1 oracle@node2 (P018) 936 1 sqlplus@node1 (TNS V1-V3) The slaves started on both nodes. On node2:$ sqlplus mylogin/mypass@bothnodes
< no alter session set parallel_instance_group= >
select /*+ full(orders_part) */ count(*) from orders_part;

QCSID Inst Group Set PROGRAM

———- ——————————————————————————-

952 1 1 1 oracle@node1 (P002)
952 1 1 1 oracle@node1 (P003)
952 1 1 1 oracle@node1 (P001)
952 1 1 1 oracle@node1 (P000)
952 1 1 1 oracle@node2 (P037)
952 1 1 1 oracle@node2 (P027)
952 2 1 1 oracle@node1 (O001)
952 2 1 1 oracle@node2 (P016)
952 2 1 1 oracle@node2 (P019)
952 2 1 1 oracle@node2 (P017)
952 2 1 1 oracle@node2 (P018)
952 1 sqlplus@node1 (TNS V1-V3)

Again, connecting on a node2 services allows load balancing between the nodes

## August 31, 2007

### Migrating large, redo intensive databases to RAC using Data Guard

Filed under: Oracle,RAC — christianbilien @ 7:21 am

Here are some thoughts and a couple of issues raised by the handling of a 10TB data base migration. This 10gR2 data warehouse had to be migrated from a non RAC, Solaris UFS file system environment towards a 3 nodes ASM RAC. What makes this move a not-so-easy endeavor is not primarily its size, but its redo log generation rate of 1GB every minute 24 hrs/ 7 days a week, though some loading is done in NO LOGGING mode. Storage replication is not an option here because of distance and budget. So I went to Data Guard, and found out a couple of things worth sharing:

Data base duplication

The key issue I tried to workaround was the estimated 30 hours of full backup/restore coupled to the 1.8TB of redo log to transfer generated in this time frame I would have to recover on the standby data base. The archived logs were furthermore backed up and purged every 3hrs to make room.

I found this little trick to trigger the Remote File Server (RFS) to start writing on the target standby platform even BEFORE the backup process starts. Here are the steps:

1. After generating a standby control file, I just started mount a dummy standby database on the target server. That’s enough for the RFS to start the log file copy.
2. I ran the rman backup on the source
3. When the backup was ready, I had to bounce in nomount mode the target DB to allow an rman “duplicate for standby”.

The alert file shows the RFS every 5’ the RFS restart attempts :

PING[ARC0]: Heartbeat failed to connect to standby ‘RACSTANDBY’. Error is 12528.

4. This is where it gets interesting, because the “duplicate for standby” mounts the standby data base, allowing the RFS to restart. The rman memory script shows :

{
set until scn 19415505108;
restore clone standby controlfile;

sql clone ‘alter database mount standby database’;
}

and into the alert file on the target system:

Redo Shipping Client Connected as PUBLIC
— Connected User is Valid
RFS[1]: Assigned to RFS process 27224
RFS[1]: Identified database type as ‘physical standby’

Caveat:

1. Allow enough space for the copied log file on the target to allow the restoration and the recovery to complete.
2. A Log archive gap was created in step 3. rman mounted the standby and started copying but some archive log files had to be manually copied. (Although the unchanged FAL configuration worked fine after the data guard setup).

ASM restoration bug

I used Metalink note 273015.1 as a starting point. You have to add to the primary (non RAC data base) 2 new logfile threads, 2 additional undo tablespaces for the second and third RAC instances and run catclust.sql. What this note does not say is the

ORA-00600: internal error code, arguments: [kjrfr7], [], [], [], [], [], [], []

you get in the ASM instance when you attempt to restore the data base (usually after several hours – grrr -). This came down as the unpublished bug 4750469, fixed in 10.2.0.3 which only went out on August 13th on Solaris x86-64, the platform I was working on. The bug can occur when oracle does a sanity checking of a resource whose memory has been released to the shared pool. I worked around the bug by “unclustering” the DB (cluster_database =false) both at the ASM and at the DB level during the restoration phase, and then put back the cluster_database to true.

## August 16, 2007

### Workload characterization for the uninformed capacity planner

Filed under: HP-UX,Models and Methods,Oracle,Solaris,Storage — christianbilien @ 7:32 pm

Doug Burns initiated an interesting thread a while ago about user or application workloads, their meanings and the difficulties associated with their determination. But workload characterization is both essential and probably the hardest and most prone to error bit off the whole forecasting process. Models that fail to validate (i.e. are not usable) most of the time fall in one of these categories:

• The choice of characteristics and parameters is not relevant enough to describe the workloads and their variations
• The analysis and reduction of performance data was incorrect
• Data collection errors, misinterpretations, etc.

Unless you already know the business environment and the applications, or some previous workload characterization is already in place, you are facing a blank page. You can always try to do the smart workload partition along functional lines, but this effort is unfortunately often preposterous and doomed to failure because of time constraints. So what can be done?

I find the clustering analysis a good compromise between time to deliver and business transactions. Caveat: this method ignores any data cache (storage array, Oracle and File System cache, etc.) and locks/latches or any other waits unrelated to resource waits.

A simple example will explain how it works:

Let’s assume that we have a server with a single CPU and a single I/O path to a disk array. We’ll represent each transaction running on our server by a couple of attributes: the service time each of these transactions requires from the two physical resources.In other words, each transaction will require in absolute terms a given number of seconds of presence on the disk array and another number of seconds on the CPU. We’ll call a required serviced time a “demand on a service center” to avoid confusion. The sum of those two values would represent the response time on an otherwise empty system assuming no interaction occurs with any other external factor. As soon as you start running concurrent transactions, you introduce on one hand waits on locks, latches, etc. and on the other hand queues on the resources: the sum of the demands is no longer the response time. Any transaction may of course visit each resource several times: the sum of the times spent using each service center will simply equal the demand.

Let us consider that we are able to collect the demands each single transaction j requires from our two resource centers. We’ll name
${D}_{j1}$ the CPU demand and ${D}_{j2}$ the disk demand of transaction j. Transaction j can now be represented by a two components workload: ${w}_{j}=({D}_{j1},{D}_{j2})$. Let’s now start the collection. We’ll collect overtime every ${w}_{j}$ that goes on the system. Below is a real 300 points collection on a Windows server. I cheated a little bit because there are four CPUs on this machine but we’ll just say a single queue represents the four CPUs.

The problem is now obvious: there is no natural grouping of transactions with similar requirements. Another attempt can be made using Neperian logs to distort the scales:

This is not good enough either to identify meaningful workloads.

The Minimum Spanning Tree (MST) method can be used to perform successive fusions of data until the wanted number of representative workloads is obtained. It begins by considering each component of a workload to be a cluster of points. Next, the two clusters with the minimum distance are fused to form a cluster. The process iterates until the final number of desired clusters is reached.

• Distance: let’s assume two workloads represented by ${w}_{i}=({D}_{i1},{D}_{i2},...,{D}_{iK})$ and ${w}_{j}=({D}_{j1},{D}_{j2},...,{D}_{jK})$. I moved from just two attributes per workload to K attributes, which will correspond to service times at K service centers. The Euclidian distance between the two workloads will be $d=\sqrt[]{\sum_{n=1}^{K}({D}_{in}-{D}_{jK})}$.
• Each cluster is represented at each iteration by its centroid whose parameter values are the means of the parameter values of all points in the cluster.

Below is a 20 points reduction of the 300 initial points. In real life, thousands of points are used to avoid outliers and average the transactions

## July 27, 2007

### Oracle DB operations (cautiously) venturing into the ITIL world

Filed under: ITIL,Oracle — christianbilien @ 9:16 pm

My interest in ITIL started a couple of years ago when activities I practiced routinely for more than 15 years started to appear in the large IT departments as processes within a larger framework of best practices. My initial interest went to Availability, IT Service Continuity and Capacity Management which are ITIL processes I had practiced from the technical side. I then expanded my knowledge to the other processes and I am now running for the Service Manager certification. Although I am an ITIL practitioner, I reckon I’ll need 2/3 months of evening time to get ready for the exams. Incidentally, this does not help me to keep up with my other nightly activities such as blogging…

ITIL is big in the UK and in the North of Europe and a number of organizations I know in the financial world in the US also adopted ITIL years ago and have now achieved the first degrees of maturity in several key ITIL processes.

It is beyond the scope of this post to explain what ITIL is (look at http://www.itil.co.uk for the official version – V3 came out in April 2007). ITIL is also one those buzzwords used out of context in many articles in the press when a link has to be established between the IT user perception and the IT deliverables. Just out of curiosity, I tried to figure out where ITIL stands in the Oracle Database world.

• My first encounter with ITIL within the Oracle community was in January 2007 when I downloaded from the RAC SIG site a presentation from Kirk McGowan, the “Rac Pack” technical Director at Oracle (http://www.oracleracsig.org/pls/htmldb/RAC_SIG.download_my_file?p_file=1001040). He called his presentation “Rac & ASM best practices”, which led me to initially believe this would be the usual blurb about the installation procedures one can otherwise find in the Oracle books. But it wasn’t. I hope I do not over summarize his presentation by saying it boiled down to “why do RAC implementations fail ?”. The answer was : “Operational Process Requirements were not met” in terms of change management, availability and capacity planning, SLAs, etc. despite the fact that the ITIL framework had been there (among others) for years.
• Second encounter: the Siebel Help Desk. It is hardly surprising ITIL gets mentioned all over the place in the marketing materials as Service Desk is one of the ITIL functions.
• Third, Oracle started to label some existing functions of the Enterprise Manager (see http://www.oracle.com/technology/products/oem/pdf/ITILBestPractices.pdf ) as contributors to ITIL processes. Incident and problem Management are also shown within the Siebel perimeter, but you’ll find the EM servicing configuration, change and release management as well as monitoring service level compliance.
• Fourth: the marketing stuff. On demand, grid, virtualization, etc. “ITIL ready” labeled (what on earth could that mean?). No need to elaborate.

A somewhat more sarcastic view for the ITIL skeptics: http://www.itskeptic.org

I occasionally write in “IT-Expert”, a French IT magazine. I wrote an article about coherence and relationships of the ITIL function and processes in the July-August issue: https://christianbilien.files.wordpress.com/2007/07/it_2007_68_rab.pdf

## July 17, 2007

### Hooray for the 11g ASM: “Fast” Mirror Resync at last !

Filed under: Oracle,RAC — christianbilien @ 9:05 pm

Forgive me if I sound over-enthusiastic: I already mentioned in RAC geo clusters on HP-UX and in RAC geo clusters on Solaris how annoying the absence of incremental mirror rebuild was to the ASM based RAC geo clusters. As a matter of fact, the fact that a full rebuild of the mirror structure is needed writes off this architecture for data bases over a few hundreds GB. Here is a very common situation to me: you have a 2 sites RAC/ASM based geo clusters made of one storage array and one node on each site. Site A has to be taken down for cooling maintenance (actually sites are taken down at least 5 times a year), but the surviving site B must be kept up to preserve the week end operations. The second site ASM instances on site B are gracefully taken down, but all of the mirrored ASM failure groups have all to be fully rebuilt when they are brought up. The more databases, the more terabytes have to move around.

An actual outage is a more dramatic opportunity for trouble: assuming the clusters transfer the applications loads on your backup site, you nonetheless have to wait for the outage cause to be fixed plus the ASM rebuild time to be back to normal. Just prey you don’t have a second outage on the backup site while the first site is busy rebuilding its ASM failure groups.

The consequences of this single ASM weakness reach as far as having to use a third party cluster on a SAN just to be able to make use of the VxVM Dirty Logging Region for large databases. Having to make a “strategic” decision (3rd party cluster or not) on such a far reaching choice solely based on this is to me a major annoyance.

There are a few promising words in the Oracle 11g new features area posted on the OTN sites about a “Fast Mirror Resync” in Automatic Storage Management New Features Overview which should just be the long awaited “DLR” ASM rebuild feature. ASM can now resynchronize the extents that have been modified during the outage. It also looks like a failure group has a DISK_REPAIR_TIME attribute that defines a window in which the failure can be repaired and the mirrored failure group storage array be brought on-line, after which and “ALTER DISKGROUP DISK ONLINE” will starts the process of resynchronization. What happens if you exceed DISK_REPAIR_TIME is not said.

## July 2, 2007

### Asynchronous checkpoints (db file parallel write waits) and the physics of distance

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 5:15 pm

The first post ( “Log file write time and the physics of distance” ) devoted to the physic of distance was targeting log file writes and “log file sync” waits. It assumed that :

• The percentage of occupied bandwidth by all the applications which share the pipe was negligible
• No other I/O subsystem waits were occurring.
• The application streams writes, i.e. it is able to issue an I/O as soon as the channel is open.

This set of assumptions is legitimate if indeed an application is “waiting” (i.e. not consuming cpu) on log file writes but not on any other I/O related events and the fraction of available bandwidth is large enough for a frame not to be delayed by another applications which share the same pipe, such as an array replication.

Another common Oracle event is the checkpoint completion wait (db file parallel write). I’ll try to explore in this post how the replication distance factor influences the checkpoint durations. Streams of small transactions make the calling program synchronous from the write in the logfile, but checkpoints writes are much less critical by nature because they are asynchronous from the user program perspective. They only influence negatively the response time when “db file parallel write” waits start to appear. The word “asynchronous” could be a source of confusion, but it is not here. The checkpoints I/Os are doubly asynchronous, because the I/Os are also asynchronous at the DBWR level.

1. Synchronous writes: relationship of I/O/s to throughput and percent bandwidth

We did some maths in figure 3 in “Log file write time and the physics of distance” aimed at calculating the time to complete a log write. Let’s do the same with larger writes over a 50km distance on a 2Gb/s FC link. We’ll also add a couple of columns: the number of I/O/s and the fraction of used bandwidth. 2Gb/s = 200MB/s because the FC frame is 10 bytes long.

Figure 1: throughput and percent bandwidth as a function of the I/O size (synchronous writes)

 I/O size Time to load (ms) Round trip latency (ms) Overhead(ms) Time to complete an I/O (ms) IO/s Throughput (MB/s) Percent bandwidth 2 0,054 0,5 0,6 1,154 867 1,7 0,8% 16 0,432 0,5 0,6 1,532 653 10,2 5,1% 32 0,864 0,5 0,6 1,964 509 15,9 8,0% 64 1,728 0,5 0,6 2,828 354 22,1 11,1% 128 3,456 0,5 0,6 4,556 219 27,4 13,7% 256 6,912 0,5 0,6 8,012 125 31,2 15,6% 512 13,824 0,5 0,6 14,924 67 33,5 16,8%

So what change should we expect to the above results if we change from synchronous writes to asynchronous writes?

2. Asynchronous writes

Instead of firing one write at a time and waiting for completion before issuing the next one, we’ll stream writes one after the other, leaving no “gap” between consecutive writes.

Three new elements will influence the expected maximum number of I/O streams in the pipe:

• Channel buffer-to-buffer credits
• Number of outstanding I/O (if any) the controller can support. This is 32 for example for an HP EVA
• Number of outstanding I/O (if any) the system, or an scsi target can support. On HP-UX, the default number of I/Os that a single SCSI target will queue up for execution is for example 8, the maximum is 255.

Over 50kms, and knowing that the speed of light in fiber is about 5 microseconds per kilometer, the relationship between the I/O size and the packet size in the pipe is shown in figure 2:

Figure 2: between the I/O size and the packet size in the fiber channel pipe

 I/O size (kB) Time to load (µs) Packet length (km) 2 10,24 2 32 163,84 33 64 327,68 66 128 655,36 131 256 1310,72 262 512 2621,44 524

The packet length for 2KB writes requires a capacity of 25 outstanding I/Os to fill the 50km pipe, but only one I/O can be active for 128KB packets streams. Again, this statement only holds true if the “space” between frames is negligible.

Assuming a zero-gap between 2KB frames, an observation post would see an I/O pass through every 10µs, which corresponds to 100 000 I/O/s. We are here leaving the replication bottleneck as other limiting factors such as at the storage array and computers at both end will now take precedence. However, a single 128KB packet will be in the pipe at a given time: the next has to wait for the previous to complete. Sounds familiar, doesn’t it ? When the packet size exceeds the window size, replication won’t give any benefit to asynchronous I/O writes, because asynchronous writes behave synchronously.

## June 26, 2007

### Log file write time and the physics of distance

Filed under: HP-UX,Oracle,Solaris,Storage — christianbilien @ 7:46 pm

I already wrote a couple of notes about the replication options available when a production is made of different storage arrays (see “Spotlight on Oracle replication options within a SAN (1/2)” and Spotlight on Oracle replication options within a SAN (2/2)).

These posts came from a real life experience, where both storage arrays were “intuitively” close enough to each other to ignore the distance factor. But what if the distance is increased? The trade-off seems obvious: the greater the distance, the lower the maximum performance. But what is the REAL distance factor? Not so bad in theory.

I’m still interested in the first place by synchronous writes, namely log file writes and associated “log file sync” waits. I want to know how distance influences the log file write time in a Volume manager (HP-UX LVM, Symantec VxVM, Solaris VM or ASM) mirroring. EMC SRDF and HP ‘s Continuous Access (XP or EVA) synchronous writes could also be considered but their protocol seems to need 2 round trips per host I/O. I’ll leave this alone pending some more investigation.

The remote cache must in both cases acknowledge the I/O to the local site to allow the LGWR’s I/O to complete.

1. Load time and the zero distance I/O completion time.

The speed of light in fiber is about 5 microseconds per kilometer, which means 200km costs 1ms one way. The load time is the time for a packet to completely pass any given point in a SAN. A wider pipe allows a packet to be delivered faster than a narrow pipe.

The load time can also be thought as the length of the packet in kilometers: the greater the bandwidth, the smaller the packet length, and the smaller the packet load time. At 2Gb/s, a 2KB packet (the typical log write size) is about 2kms long, but it would be 2600 km long for a 1.5Mb/s slow link.

Zero distance I/O completion time

The zero distance I/O completion time is made of two components:

• A fixed overhead, commonly around 0.5 ms (the tests made and reproduced below on fig.1 corroborates the fact that the I/O time on a local device is only increased by 10% when the packet size more than doubles). This represents storage array processor time and any delay on the host ports for the smallest packet.
• The load time, a linear function of the packet size.

At the end of the day, the zero distance I/O completion time is :

Slope x Packet size + overhead

Here is one of the measurements I reported in the “Spotlight on Oracle replication post” :

Figure 1 : Measured I/O time as a function of the write size for log file writes

 Write size (k) I/O time (ms) 2 0,66 5 0,74

A basic calculation gives :

Slope = (5-2)/(0,74-0,66)=0,027

Figure 2 : Effect of the frame size on zero distance I/O completion time :

 Frame size (k) Time to load 2 0,65 16 1,03 32 1,46 64 2,33 128 4,06

A small frame such as a log write will heavily depend upon the overhead, while the slope (which itself is a linear function of the throughput) is predominant for large frames.

2. Synchronous I/O time

The transfer round trip (latency) is the last component of the time to complete a single I/O write over distance. It is equal to

2xDistance (km) x 5µsec/km

Figure 3: Time to complete a 2K synchronous write (in ms)

 km Round trip latency Time to load Overhead Time to complete the log write 10 0,1 0,654 0,6 1,354 20 0,2 0,654 0,6 1,454 30 0,3 0,654 0,6 1,554 40 0,4 0,654 0,6 1,654 50 0,5 0,654 0,6 1,754 60 0,6 0,654 0,6 1,854 70 0,7 0,654 0,6 1,954 80 0,8 0,654 0,6 2,054 90 0,9 0,654 0,6 2,154 100 1 0,654 0,6 2,254 110 1,1 0,654 0,6 2,354 120 1,2 0,654 0,6 2,454 130 1,3 0,654 0,6 2,554 140 1,4 0,654 0,6 2,654 150 1,5 0,654 0,6 2,754

This is quite interesting as the log writes are only about twice as slow when you multiply by 15 the distance.

## June 19, 2007

### Spotlight on Oracle replication options within a SAN (2/2)

Filed under: Oracle,Solaris,Storage — christianbilien @ 7:57 pm

This post is a follow up to “Spotlight on Oracle replication options within a SAN (1/2)”. This first post was about the available replication options.

I will address in this post a specific performance aspect for which I am very concerned for one of my customers. This is an organization where many performance challenges come down to the commit wait time: the applications trade at the millisecond level which translates in data base log file syncs expressed in hundredth of microseconds. It is a basic DRP requirement that applications must be synchronously replicated over a 2,5 kms (1.5 miles) Fiber Channel network between a local and a remote EMC DMX 1000 storage array. The mutipathing software is Powerpath, the DMX1000 volumes may be mirrored from the local array to the remote by either VxVm, ASM or SRDF.

Two options may be considered:

• Host based (Veritas VxVM, Solaris Disk Suite or ASM) replication
• Synchronous SRDF replication

All options may not always be available as RAC installations over the two sites will require a host based replication. On the other hand, simple replication with no clustering may either use SRDF of a volume manager replication.

I made some unitary tests aimed at qualifying the SRDF protocol vs. a volume manager replication. Let us just recall that an SRDF mirrored I/O will go in the local storage array cache, and will be acknowledged to the calling program only when the remote cache has been updated. A VM is no less different in principle: the Powerpath policy dictates that both storage arrays must acknowledge the I/O before the calling program considers it is completed.

Test conditions:

• This is a unitary test. It is not designed to reflect an otherwise loaded or saturated environment. The conclusion will however shed some light on what’s happening under stress.
• This test is specifically designed to show what’s happening when performing intense log file writes. The log file write size is usually 2k, but I saw it going up to 5k.
• The test is a simple dd if=/dev/zero of=<target special file> bs=<block size> count=<count>. Reading from /dev/zero ensures that no read waits occurs.

Baseline: Local Raw device on a DMX 1000
Throughput=1 powerpath link throughput x 2

 Block size (k) I/O/s I/O time (ms) MB/s 2 1516 0,66 3,0 5 1350 0,74 6,6

Test 1: Distant Raw device on a DMX
Throughput=1 powerpath link throughput x 2

 Block size (k) I/O/s I/O time (ms) MB/s 2 1370 0,73 2,7 5 1281 0,78 6,3

The distance degradation is less than 10%. This is the I/O time and throughput I expect when I mirror the array volumes by VxVM or ASM.

Test 2: Local raw device on a DMX, SRDF mirrored
Throughput=1 powerpath link throughput x 2

 Block size (k) I/O/s I/O time (ms) MB/s 2 566 1,77 1,1 5 562 1,78 2,7

This is where it gets interesting: SRDF will double the I/O time and halve the throughput.

Conclusion: When you need log file write performance in order to minimize the log file sync wait times, use a volume manager (including ASM) rather than SRDF. I believe this kind of result can also be expected under either the EVA or XP Continuous Access. The SRDF mirrored I/O are even bound to be more impacted by an increasing write load on the storage arrays as mirroring is usually performed via dedicated ports, which bear the load of all of the writes sent to the storage array. This bottleneck does not exist for the VM replication.

« Previous PageNext Page »

Blog at WordPress.com.