Christian Bilien’s Oracle performance and tuning blog

December 24, 2007

11g ASM preferred reads for RAC extended clusters

Filed under: Oracle,RAC — christianbilien @ 10:14 am

I expressed this summer my delight over the 11g fast resync option for extended (geographical) RAC clusters, for which the mirrored failure groups have to be on different sites, like having one of the failure group on array 1 on site 1 and the mirrored copy on array 2 on site 2. The major 10g limitations are twofold in this area:

  • The loss of an array means the mandatory reconstruction of the whole data base.
  • Failgroups have to be reconstructed serially (no more than one rebalance activity per node, per ASM group).

This is really painful as any scheduled shutdown of one array while the other one is still active is analogous to an outage. Imagine what it takes to rebuild 100TB.

Normal redudancy (mirrored) reconstruction is much slower than either the SAN or the storage layout allows (there is room for “research” here – the best I have seen so far is 150GB/Hrs, that’s 43 MB/s on 64 disks 15000 rpm on the highest DMX storage array. The FC ports where 2x 4Gb/s).

The second problem with RAC 10g extended clusters is linked to the way the ASM storage allocator will create primary and mirrored extent. In a disk group made of two failgroups, notions of “primary fail group” and “secondary fail group” will not exist. Each failgroup contains an equal number of primary and mirrored copies. Let’s take the example of a diskgroup made of two failgroups, each on a different storage array. For each of the storage arrays, here is the number of primary and mirrored copies for a given group (it is made of just one datafile here):

We’ll use the X$KFFXP table. The columns of interest here are:

GROUP_KFFXP : disk group number
LXN_KFFXP: 0: primary extent/ 1: mirrored extent
DISK_KFFXP: disk where the extent has been created
NUMBER_KFFXP : ASM file number

select disk_kffxp, lxn_kffxp, count(*) from x$kffxp x, v$asm_file v
where group_kffxp=1 and x.number_kffxp=v.file_number and xnum_kffxp!=65534
group by disk_kffxp, lxn_kffxp;

 

 DISK_KFFXP  LXN_KFFXP   COUNT(*)
 1          0       1050

 1          1       1050

 2          0       1050

 2          1       1050

The number of primary and mirrored extents is balanced on each of the disks of group 1. The logic is that whilst write access will equally hit both types of extents, reads will only access the primary extents. It makes sense to spread primary extents over the two failure groups to maximize throughput when the two failure groups are on an equal foot: same throughput and same I/O response time.

But what happens when the boxes are in different locations? As each instance will read on both arrays, the distance factor will introduce delays on half of the reads. The delay is a function of the distance, but we also need to take into account that the number of storage hops the reads must traverse will increase when remote sites are to be reached. Even worse, the inter sites links are bound to congestion as they must accommodate oversubscription and link contention.

Whenever possible, a clever use of a RAC data locality will reduce the RAC interconnect burden and reduce reliance on cache fusion. It can also in 11G reduce the inter site link throughput requirements and possibly speed up reads by reading data locally, assuming the local array is not itself overloaded.

The preferred read is an instance wide initialization parameter:

ASM_PREFERRED_READ_FAILURE_GROUPS = diskgroup_name.failure_group_name,…

ASM extent layout reference: Take a look at Luca Canali’s work: a peek inside Oracle ASM metadata

— Updated: check the comment below for Luca’s updated links – very interesting indeed —

December 13, 2007

One of Pavlov’s dogs (2/2)

Filed under: Oracle,RAC,Solaris — christianbilien @ 9:15 pm

I didn’t get much chance with the Pavlov’s dogs challenge : not even a single attempt at an explanation. Maybe it was too weird a problem (or more embarrassing it did not interest anyone! ).

A quick reminder about the challenge: netstat shows almost no activity on an otherwise loaded interconnect (500Mb/s to 1Gb/s inbound and similar values going outbound as seen on the Infiniband switches and calculated as the product of PX remote messages recv’d/send x parallel_execution_message_size).

Well, anyway, here is what I think is the answer: the key information I gave was that the clusterware was running RDS over Infiniband. Infiniband HCAs have an inherent advantage over standard Ethernet network interfaces: they embed RDMA, which means that all operations are handled without interrupting the CPUs. That’s because the sending nodes read and write to the receiving node using user space memory, without going through the usual I/O channel. TCP/IP NICs also cause a number of interrupts the CPUs have to process because TCP segments have to be reconstructed while other threads are running.

The most likely cause of the netstat blindness is just that it cannot see the packets because the CPUs are unaware of them.

To quote the Wikipedia Pavlov’s dogs article, “the phrase “Pavlov’s dog” is often used to describe someone who merely reacts to a situation rather than use critical thinking”. That’s exactly what I thought of myself when I was trying to put the blame on the setup instead of thinking twice about the “obvious” way of measuring a network packet throughput.

December 9, 2007

One of Pavlov’s dogs (1/2)

Filed under: Oracle,RAC,Solaris — christianbilien @ 7:44 pm

That’s what I thought of myself after spending an hour trying to figure out what I had done wrong in my setup.

So here is a little challenge I derived from this experience:

1. I set up a 5 node 10gR2 RAC over a mix of 5 Xeon and Opteron 4 cores. The interconnect is RDS over an HCA Infiniband. The setup was made on OpenSolaris. I know, RAC is not a supported on OpenSolaris but we were advised by Sun MicroSystems that the Infiniband layer was significantly faster on OpenSolaris that it was on Solaris 10. Some ftp tests indeed showed that even IPoIB was 20% faster on OpenSolaris than on Solaris 10. So I had to tweak the Oracle installer and could not use any of the Oracle Guis, but I got it working.

2. select /*+ full(c) full(o) */ * from soe.product_descriptions c,soe.product_information o where c.product_id=o.product_id and c.product_id>300000;

was run simultaneously 10 times on each instance (this is incidentally a data base generated by Swingbench ). The DOP varies from 10 to 40, but it does not make much difference as far as our challenge is concerned.

I then plotted the ‘PX remote messages recv’d’ and ‘PX remote messages sent’.

Parallel_execution_message_size=8192

I could get the following peak throughput (5s interval) by multiplying parallel_execution_message_size by the PX remote messages figures:

PX remote messages recv’d Mb/s PX remote messages sent Mb/s

792,82

1077,22

1035,78

756,21

697,01

885,18

I am not taking into account the GES and the GCS messages, nor did I count the cache fusion blocks. Both of them were small anyway. The weird thing came when I tried to measure the corresponding HCA Input / Output packet number from netstat:

Input Packets/s Output Packets/s

69

73

39

35

58

6

Almost no traffic on the interconnect (the HCA MTU is 2044 bytes) !

Let’s check that the interconnect is running over the intended HCA:

netstat -ni

Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs CollisQueue
lo0 8232 127.0.0.0 127.0.0.1 32364353 0 32364353 0 0
ibd0 2044 10.0.0.0 10.0.0.1 27588368 0 26782527 0 0
nge1 1500 10.x.x.x 10.x.x.x 640710585 0 363462595 0 0

SQL> oradebug setmypid
SQL> oradebug ipc
SQL> oradebug tracefile_name

[…]
SSKGXPT 0x64362c8 flags SSKGXPT_READPENDING socket no 8 IP 10.0.0.1
[…]

Just to be sure:

SQL> select indx,picked_ksxpia,ip_ksxpia from x$ksxpia;
INDX PICKED_KSXPIA IP_KSXPIA
————————————————————-
0 CI 10.0.0.1

Question: What could possibly justify the discrepancies between the netstat output and the values collected from v$px_sesstat ?

Hints: it is not a bug

 

December 3, 2007

Where is the SAN admin ?

Filed under: Storage — christianbilien @ 10:07 pm

Many performance assessments start with the unpleasantness of having to guess a number of configuration items for lack of available information and/or knowledge. Whilst the server which hosts the database usually quickly delivers its little secrets, the storage configuration information is frequently more difficult to obtain from a remote administrator who has to manage thousands of luns. Many databases suffer from other I/O contributors to the storage network and arrays not even mentioning absurdities in the DB to storage array mapping.

Here are some little tricks which may be of interest to the information hungry analyst.

This case involves a slow Hyperion database. Of course this may seem quite remotely related to Oracle technologies (although Hyperion – the company – was part of the Oracle buying frenzy) but it still brings the thoughts and ideas I’d like to share.

The database is a 2Gbytes “cube” which only stores aggregations. The server configuration is a 4 cores rp3440/ 8GB memory running HP-UX 11iv2, the storage box is an EMC cx400 Clariion. The cube is stored on a single 4GB (!) lun Raid 5 9+1 (nine columns plus one parity stripe unit). The storage bit is outsourced, no throughput or I/O/s calibration requirements were done in the first place and the outsourcer probably gave the customer an available lun without further considerations (I’m sure this sounds familiar to many readers). There is no performance tool on the Clariion. We know at least that the raid group is not used by other luns. As an Hyperion consultant has already been through the DB elements, we’ll only focus on providing the required I/O bandwidth to the DB.

We’ll try to answer to :

If we knew what the requested DB I/O rate was, which RAID configuration would we ask to the outsourcer ?

The figures I got at a glance on the server are stable over time:

From sar –u:

System: 30%
User: 20%
Wait for I/O: 50%
Idle (and not waiting for I/O): 0%

The memory freelist length shown by vmstat states than less than half the memory is in use.

From sar –d and Glance UX (an HP performance tool)

%busy: 100
average lun queue length: 40
average time spent in the lun queue: 25ms
I/O rate: 1000I/O/s (80% reads)
average lun read service time: 6ms

As the lun is obviously going at its maximum I/O/s (for the given reads and writes), we can get a maximum I/O read rate per disk of 1/6ms = 167 reads/s. Here I made the assumption that reads would not spend a significant time being serviced by the array processors and the SAN does not introduce any additional delay.

We can also derive from the 800 reads/s an average OS read rate of about 80I/O/s per disk, which leaves 167-80=87 I/O/s per disk charged to the write calls. You may remember the “small write penalty” from Join the BAARF party..: one OS write will generate two physical reads and two physical writes on disks. Hence, we have 44 disk reads and 44 disk writes generated by the OS writes (I rounded up 43.5 to 44). This is approximately consistent with 20 OS writes/s = 40 disk reads + 40 disk writes if no stripe aggregation exists. The 10% margin of error can be ignored here.

Knowing of a maximum I/O rate of 167/s (this should not be taken too literally – it could be rounded to 150 for example — ). We can now play with various configurations by computing disk theoretical I/O rates. We’ll rule out the configurations for which the disk I/O rate exceeds a 150I/O/s threshold:

 

 

Number of RAID 5 columns

Lun I/O rate

10

12

14

16

1000

167

139

119

104

2000

333

278

238

208

3000

500

357

357

313

4000

667

556

476

417

5000

833

694

595

521

 

 

Number of RAID 10 disks

  20 24 28
32
Lun I/O rate        

1000

61

51

44

38

2000

122

102

87

76

3000

183

153

131

115

4000

244

204

175

153

5000

306

255

218

191

 

Lun I/O rate

Raid 5 5 col + 4 LVM cols Raid 5 10 col + 4 LVM cols

1000

83

42

2000

167

83

3000

250

125

4000

333

167

5000

417

208

(*) 10 columns=20 disks

Blog at WordPress.com.