Christian Bilien’s Oracle performance and tuning blog

December 3, 2007

Where is the SAN admin ?

Filed under: Storage — christianbilien @ 10:07 pm

Many performance assessments start with the unpleasantness of having to guess a number of configuration items for lack of available information and/or knowledge. Whilst the server which hosts the database usually quickly delivers its little secrets, the storage configuration information is frequently more difficult to obtain from a remote administrator who has to manage thousands of luns. Many databases suffer from other I/O contributors to the storage network and arrays not even mentioning absurdities in the DB to storage array mapping.

Here are some little tricks which may be of interest to the information hungry analyst.

This case involves a slow Hyperion database. Of course this may seem quite remotely related to Oracle technologies (although Hyperion – the company – was part of the Oracle buying frenzy) but it still brings the thoughts and ideas I’d like to share.

The database is a 2Gbytes “cube” which only stores aggregations. The server configuration is a 4 cores rp3440/ 8GB memory running HP-UX 11iv2, the storage box is an EMC cx400 Clariion. The cube is stored on a single 4GB (!) lun Raid 5 9+1 (nine columns plus one parity stripe unit). The storage bit is outsourced, no throughput or I/O/s calibration requirements were done in the first place and the outsourcer probably gave the customer an available lun without further considerations (I’m sure this sounds familiar to many readers). There is no performance tool on the Clariion. We know at least that the raid group is not used by other luns. As an Hyperion consultant has already been through the DB elements, we’ll only focus on providing the required I/O bandwidth to the DB.

We’ll try to answer to :

If we knew what the requested DB I/O rate was, which RAID configuration would we ask to the outsourcer ?

The figures I got at a glance on the server are stable over time:

From sar –u:

System: 30%
User: 20%
Wait for I/O: 50%
Idle (and not waiting for I/O): 0%

The memory freelist length shown by vmstat states than less than half the memory is in use.

From sar –d and Glance UX (an HP performance tool)

%busy: 100
average lun queue length: 40
average time spent in the lun queue: 25ms
I/O rate: 1000I/O/s (80% reads)
average lun read service time: 6ms

As the lun is obviously going at its maximum I/O/s (for the given reads and writes), we can get a maximum I/O read rate per disk of 1/6ms = 167 reads/s. Here I made the assumption that reads would not spend a significant time being serviced by the array processors and the SAN does not introduce any additional delay.

We can also derive from the 800 reads/s an average OS read rate of about 80I/O/s per disk, which leaves 167-80=87 I/O/s per disk charged to the write calls. You may remember the “small write penalty” from Join the BAARF party..: one OS write will generate two physical reads and two physical writes on disks. Hence, we have 44 disk reads and 44 disk writes generated by the OS writes (I rounded up 43.5 to 44). This is approximately consistent with 20 OS writes/s = 40 disk reads + 40 disk writes if no stripe aggregation exists. The 10% margin of error can be ignored here.

Knowing of a maximum I/O rate of 167/s (this should not be taken too literally – it could be rounded to 150 for example — ). We can now play with various configurations by computing disk theoretical I/O rates. We’ll rule out the configurations for which the disk I/O rate exceeds a 150I/O/s threshold:

 

 

Number of RAID 5 columns

Lun I/O rate

10

12

14

16

1000

167

139

119

104

2000

333

278

238

208

3000

500

357

357

313

4000

667

556

476

417

5000

833

694

595

521

 

 

Number of RAID 10 disks

  20 24 28
32
Lun I/O rate        

1000

61

51

44

38

2000

122

102

87

76

3000

183

153

131

115

4000

244

204

175

153

5000

306

255

218

191

 

Lun I/O rate

Raid 5 5 col + 4 LVM cols Raid 5 10 col + 4 LVM cols

1000

83

42

2000

167

83

3000

250

125

4000

333

167

5000

417

208

(*) 10 columns=20 disks

7 Comments »

  1. Could you explain where the “800 reads/s” number came from ? Also, I do not understand the three tables, in particular the “LUN I/O rate” column meaning and the derivation of the cell numbers.

    Thanks.

    Comment by Val Carey — December 5, 2007 @ 2:20 am

  2. Val,

    1. I had dropped one line: I/O rate as seen from the server 1000I/O/s (80% reads)(from glance/ux). Thanks for pointing it out, I added it in the post.
    2. The lun I/O rate is the incoming rate the DB would emit if there was no queuing at the disk array. It is 1000I/O/s now but it would increase if the raid was able to deliver at a faster rate. The cells are individual disk I/O rate. In the fist table for example, a RAID 5 12 columns would deliver 139 I/O/s. This computation is made considering no stripe agreggation, hence 1 write = 2 reads + 2 writes evenly split between 2 disks.

    Christian

    Comment by christianbilien — December 5, 2007 @ 9:55 am

  3. OK, thank you.

    I am still uncomfortable with some figures, so if you could spend a bit of time explaining further I’d be grateful.

    Let’s take the LUN IO request arrival rate to be 1000 per sec and consider RAID10 with 10 mirrored disks for simplicity. How did you compute the 61 figure as the individual disk IO request rate ? Why not 10 (one request per each disk) or 100 (a request for the whole stripe), or 50 (absent any info about individual request length distribution) ?

    Thanks.

    Comment by Val Carey — December 5, 2007 @ 1:42 pm

  4. For Raid 10 (on 20 disks):

    – Assuming we have 1000 I/O/s at the OS level, we would have 100*80% = 80 reads/s for each pair of mirrored disk, so each disk would actually only serve 40 reads/s (reads alternates on the mirrors).
    – Our rate write is 20/s for each pair of mirrored disk, which translates into 20/s per disk.

    That’s 40+20=60 I/O/s per disk.

    My spreadsheet was actually based on 22% writes, which I rounded in the post into 20%, hence a slight variation.

    Christian

    Comment by christianbilien — December 5, 2007 @ 5:28 pm

  5. Ah, ok, that makes sense assuming uniform distribution of io requests amongst disks of the size not more than one stripe block(stripe size).

    My comment above should actually have been : “Why not 100 (one request per each disk) or 1000 (a request for the whole stripe), or 500 (absent any info about individual request length distribution) ?”

    With your assumptions, 100 translates into 60.

    Thanks.

    Comment by Val Carey — December 6, 2007 @ 1:21 am

  6. Seems like an interesting text, but difficult to understand without previous knowledge:

    ” As the lun is obviously going at its maximum I/O/s (for the given reads and writes)”

    Is this conclusion arrived because of the ‘%busy: 100’ ?

    “we can get a maximum I/O read rate per disk of 1/6ms = 167 reads/s.”
    Really? Why?

    ” We can also derive from the 800 reads/s an average OS read rate of about 80I/O/s per disk,”

    looking looking looking – ah, it’s perhaps 1000 X(80% reads). No, the maximum was 167, way bigger than the 800.

    “You may remember the “small write penalty” from Join the BAARF party..: one OS write will generate two physical reads and two physical writes on disks. Hence, we have 44 disk reads and 44 disk writes generated by the OS writes (I rounded up 43.5 to 44).”

    How to link the 800 to 2 times 44?

    The article end with a list of numbers. What is your point or conclusion here?

    H.

    Comment by helma — January 16, 2008 @ 11:19 am

  7. Hello Helma,

    Sorry I did not answer earlier.

    ==> Is this conclusion arrived because of the ‘%busy: 100′ ? No, because of the queue length. It is quite possible to have a 100% busy channel without the underlying volume to be running at 100% of the disks throughputs.

    ==> “we can get a maximum I/O read rate per disk of 1/6ms = 167 reads/s.” I just wrote a post about the Utilization law, which can also be used derive the max I/O rate

    ==> ” We can also derive from the 800 reads/s an average OS read rate of about 80I/O/s per disk,”: that’s 1000×80% reads from the volume/raid group (167 is for a disk)

    ==> “You may remember the “small write penalty” from Join the BAARF party..: one OS write will generate two physical reads and two physical writes on disks. Hence, we have 44 disk reads and 44 disk writes generated by the OS writes (I rounded up 43.5 to 44).” We know that we have 87 I/O/s per disk charged to write calls , which correspond in a raid 5 to 2 reads + 2 writes. reads=87/2 # 44 I/O/s, writes = 87/2 # 44 I/O/s

    ==> “The article end with a list of numbers. What is your point or conclusion here?”: We can now play with various configurations by computing disk theoretical I/O rates. We’ll rule out the configurations for which the disk I/O rate exceeds a 150I/O/s threshold (in red the configurations for which the theoretical I/O rate exceeds 150 I/O/s)

    Comment by christianbilien — January 29, 2008 @ 4:09 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: