Christian Bilien’s Oracle performance and tuning blog

February 12, 2008

The “log file sync” wait event is not always spent waiting for an I/O

Filed under: Oracle,Storage — christianbilien @ 4:41 pm

I went through a rather common problem some times ago, which shows how one can get easily fooled by the “log file sync” wait event. Here is a log file sync wait histogram which was interpreted as a poor storage array response time by the DBAs, often a good way to get off the hook of a performance problem:

duration_of_wait_1.jpg

This looks like a Gaussian distribution with a mean value of maybe 30ms. The same event recorded some weeks earlier shows a mean value located between 8 and 16 ms.

duration_of_wait_2.jpg

Each log file sync may wait up to an average of 101ms over an hour:

Event                        Avg wait Time (ms)
------------------------------------------
log file sync                       101

The storage array write destination is a rather loaded EMC DMX1000, and the multiplexed log files are SRDF mirrored. SRDF is the EMC protocol which replicates I/Os from one array to another one. The replication is synchronous here, meaning the remote cache array has to acknowledge each write to the local one before returning a completion status to the LGWR. The SRDF write time must be added to the average write time to reflect the array response time. Some overhead also exists in the OS layer and possibly at the SAN level. Both redo log groups have similar write access patterns. The SRDF write time is close to 30ms over this period of time, a very bad figure. The average write time on the array is 4ms.

This is where a bit of dig in may be interesting:

All calls are emitted from a single program which basically does some small transactions and commits. We’ll consider that no other conditions may trigger a write into the redo log other than a commit.

The average log file parallel write wait over the same hours is not good, but far from the by the 101ms log file sync:

Event                          Avg wait Time (ms)
------------------------------------------
log file parallel write              37

This fits with the array I/O times.

A quick recap on the ‘log file sync’ wait event may be necessary here:

The log file sync wait event is the time the foreground processes spends waiting for the redo to be flushed. It can be assumed here that the LGWR is always active, it does not have to be awaken by the system dispatcher.The log file sync wait can be broken into:

  1. The ‘redo write time’: this is the total elapsed time of the write from the redo log buffer to the current redo log file (in centiseconds).
  2. The “log file parallel write” is basically the time for the log write I/O to complete
  3. The LGWR may have some post processing to do, then signals the waiting foreground process that the write has completed. The foreground process is finally waken up by the system dispatcher. This completes the ‘log file sync’ wait.

I also extracted two other log write statistics from the awr report:

Statistic          Total     per Second     per Trans
-----------------------------------------------------------------------------
redo write time  275,944           77.6           0.2
user commits   1,163,048          327.1           1.0

776/327 = 2.3ms per user commits of “redo write time”

Step 3 therefore accounts for 101-2.3-37=61,7ms. The most likely cause of the main contributor to the log file sync can be found here:

sar -q -f /var/adm/sa/sa07
         runq-sz %runocc swpq-sz %swpocc
05:00:01   15.9      48     0.0      0
05:05:01   24.1      42     0.0      0
05:10:01   25.2      46     0.0      0
05:15:00   15.5      38     0.0      0
05:20:01   15.6      43     0.0      0
05:25:00   14.4      38     0.0      0
05:30:00   17.2      36     0.0      0
05:35:00   18.0      47     0.0      0
05:40:00   16.9      38     0.0      0
05:45:01   12.3      33     0.0      0
05:50:00   10.1      41     0.0      0
05:55:00   11.4      39     0.0      0
06:00:00   11.1      47     0.0      0

Look at the first column, which shows the CPU run queue size. Many processes are competing for the CPUs. Although we could carry on the investigation by measuring how much time foreground processes are spending in the run queue, we can at this point conclude that the processes are spending about two thirds of the log file sync wait times waiting for the CPU, not for the I/O subsystem.

It would also be interesting for the sake of completeness to understand why the %runocc is relatively small compared to the average run queue size.

About these ads

19 Comments

  1. “It would also be interesting for the sake of completeness to understand why the %runocc is relatively small compared to the average run queue size.”

    ‘runocc’ shows percentage of time when the the run queue had at least one element. Both values are updated once a second I think. Since about 50% of time the queue was empty, one has to assume that there were bursts of ready to run processes, say, 150 times the queue was empty and 150 times it had 30 processes in it. ‘runq-sz’ is just a 5 min average.

    On the other hand, you can have ‘runq-sz’ equal to one and ‘runocc’ equal to 100% with just N+1 CPU bound processes where N is the number of CPUs.

    Comment by Val — February 13, 2008 @ 1:18 am

  2. I think the run queue values from sar vary by system. I always presumed that they would be some kind of accurate maths, but when I looked at the Solaris source code they weren’t (opensolaris.org). I think they are accurate though, but it just didn’t do it the way I assumed it would.

    Solaris tracks things when they change state or value, and keeps a running total of the time spent in each state or value. So when the run queue is greater than zero, the run queue occupied time is increased. If the run queue drops back to zero, then the run queue occupied time is not increased. Likewise Solaris keeps a running count of the size of the run queue multiplied by the time it was at that size. Therefore in any sample period %runocc is the run queue occupied time divided by the sample time. And the run queue average size is the ‘total size multiplied by the time at that size during this period’ divided by the ‘elapsed time when there was something in the queue’ i.e. the average queue size when there was something in the queue.

    The actual line that prints out the ‘sar -q’ from OpenSolaris is:

    (void) printf(” %7.1f %7.0f”,
    (float)xx->si.runque / (float)xx->si.runocc,
    (float)xx->si.runocc / (float)xx->si.updates * 100.0);

    sar is linked to sadc, which actually collects all of these statistics from the kernel, and then sends them to sar for displaying. And on Solaris sadc actually uses the public kstat interface to get at these kernel statistics. So there is nothing stopping you writing something similar to sar, using the kstat interface, and displaying whatever derived CPU utilisation figures you want. I have done this a few times, and you can get at a much richer set of data than what sar gives you. That is the downside of using something which is very old, and is the lowest common subset of available data on UNIX systems.

    On a Linux system I think you can use the files in ‘/proc’ to get current system data, such as CPU utilisation and run queues.

    I agree with Val that the high run queue with less than 100% occupation indicates some kind of activity storm is occurring in very short bursts. I would guess at either latch contention or semaphore wake up storms. This would depend on how many CPUs you have in the system (you don’t mention) and other factors such as the spin_count setting (which I now believe is hidden in 10g). These kind of contention storms can be difficult, and can often be worse on systems with more CPUs than those with less. I have seen systems run slower when more CPUs were added, from 4 to 8 to 16, as the contention and busy spinning increased.

    You can also run into the scenario where the process with the latch is kicked off a CPU due to its timeslice ending, and then all the other processes just spin busily on the CPU trying to get the latch but failing. Because they are 100% active on a CPU, and do no I/O they can run to the end of their timeslice quanta, before giving up the CPU. This means that the process with the latch is delayed for some time before it can get back on the CPU and finish off what it started, and then free the latch. The system looks 100% CPU and very busy, but the actual throughput is very low, and is caused by the busy spinning blocking the working process getting on with its work.

    John

    Comment by John — February 13, 2008 @ 11:31 am

  3. John & Val,

    I basically had the same ideas about a “storm” (run queue occupied 40 to 50% of the time with a large run queue). I’m not sure I’ll be able to further investigate this case, but I suspect something like a sleeping wait on an event which triggers all the waiting processes at once, causing a high activity for 2-3′, and then waiting again.

    Thanks for the comments,

    Christian

    Comment by christianbilien — February 13, 2008 @ 12:01 pm

  4. Christian,

    I had not spotted originally that the sar timings were 5 minutes apart, which can hide a lot of things going on. If I was investigating this I would drop down to 10 second, or even 1 second samples, running sar interactively on the system to see if there are such storms occuring.

    Another thing I have seen is not latch / semaphore storms but fork / exec storms burning up CPU. It is a really bad way to write monitoring scripts, but some people do it. They put an entry in cron, or write a loop that sleeps for 10 seconds, and then it runs another program to gather some data from somewhere and write it somewhere else. So every 10 seconds this monitoring program gets run briefly, and it does a fork / exec to run the other program. I have seen systems where they have these running every second, and have a number of them all gathering different system statistics. I believe BMC Patrol used to work like this in the past, but has now been rewritten to be a self contained single daemon process.

    You can check this by looking at ‘sar -c’ output for system calls, which includes forks and execs. In a static system these should be very low (10’s) and not in the 100’s. I believe that these figures are normalised to every second values anyway in the sar output.

    John

    Comment by John — February 13, 2008 @ 2:08 pm

  5. [...] Bilien demonstrates that the “log file sync” wait event is not always spent waiting for an I/O, it may be trying to get some CPU [...]

    Pingback by Log Buffer #84: a Carnival of the Vanities for DBAs — February 15, 2008 @ 6:34 pm

  6. An interesting post, I am currently investigating log file sync wait times on our
    performance test system by piecing together the data from the AWR reports and taking
    into account OS performance.

    In a 30 second window where a data loading task was writing 3MB redo per second with
    a small number of OLTP trx/s we had:

    51 foreground log file sync waits with an average of 85 ms
    126 log file parallel write with an average of 68 ms

    Redo write time of 879 cs
    commits 91
    rollbacks 143 (Issue with jdbc driver on session close)

    This gives us 61.4ms per trx.

    The oracle documentation states:

    The log file sync wait may be broken down into the following components:
    1. Wakeup LGWR if idle
    2. LGWR gathers the redo to be written and issue the I/O
    3. Time for the log write I/O to complete
    4. LGWR I/O post processing
    5. LGWR posting the foreground/user session that the write has completed
    6. Foreground/user session wakeup

    Tuning advice based on log file sync component breakdown above:
    Steps 2 and 3 are accumulated in the “redo write time” – 61.4ms
    Step 3 is the “log file parallel write” wait event. – 68 ms
    Steps 5 and 6 may become very significant as the system load increases.
    This is because even after the foreground has been posted it may take a some
    time for the OS to schedule it to run. May require monitoring from O/S level.

    Redo write time should be more that the log file parallel write.

    The data above may be purely be down to statistic errors in the database.

    What is also interesting is that if I add the redo write time to the log file parallel
    write the time exceeds the log file sync wait time. This is understanable if the Oracle
    documentation is to be believed as log file parallel write is included in the redo write time!!!

    Also in this scenario the OS does not appear to be loaded so I am not sure why the log
    file sync wait event is so much more inflated.

    Any thoughts?

    Adrian

    Comment by Adrian — February 28, 2008 @ 11:28 am

  7. Manly Men Only Use Solid State Disk For Redo Logging. LGWR I/O is Simple, But Not LGWR Processing

    Let’s look at some statspack data from an OLTP Order Entry benchmark run that achieved 133 transaction per second (TPS):
    Top 5 Timed Events Avg %Total
    ~~~~~~~~~~~~~~~~~~ …

    Trackback by Kevin Closson's Oracle Blog: Platform, Storage & Clustering Topics Related to Oracle Databases — March 26, 2008 @ 5:17 am

  8. Don’t Forget ARCH When Pondering Redo and Throwing Partners Under The Bus For Fun, Not Profit.

    In my recent post about LGWR processing, I used a title that fits my recurring Manly Men motif just for fun. The title of the post was Manly Men Only Use Solid State Disk for Redo Logging. The post then set out to show that LGWR performance issues are …

    Trackback by Kevin Closson's Oracle Blog: Platform, Storage & Clustering Topics Related to Oracle Databases — March 26, 2008 @ 5:19 am

  9. Hi Christian.
    Sorry for may english.

    I would like know if I understud right. Staspack, report “redo write time” and “user commits” like “per Second”.

    I read spcpkg.sql in ${ORACLE_HOME}/rdbms/admin. In this file the snapshot of statspacke run just a copy of this statistics from v$sysstat view, so their meaning is like original:

    redo write time:
    “Total elapsed time of the write from the redo log buffer to the current redo log file in 10s of milliseconds”

    user commits:
    “Number of user commits. When a user commits a transaction, the redo generated that reflects the changes made to database blocks must be written to disk. Commits often represent the closest thing to a user transaction rate.”

    This mean that “redo write time” have a second (cs) as unit of measurement, meanwhile user commits is dimensionless.

    This mean that (in your example):
    “77.6 cs” => 776 x10^-1 cs => 776 ms
    “327.1” is just the number of commit.

    This mean 776 [ms] / 327 [no dimension] => 2.3 [ms]

    Is it rigth. I’m confused by statspack report, because each “Statistic” is reported in “per Seconds”.

    Just another think. Should I count the user rollbacks statistics whith user commits statistics?

    Thank you in advice.
    Andrea

    Comment by Andrea — July 1, 2008 @ 9:48 am

  10. ehm…..may=my :(

    Comment by Andrea — July 1, 2008 @ 9:49 am

  11. Hi Christian,

    Nice post. I could possibly extend these discussions to come to some sort of explanation for the behavior you saw. From the discussions, I conclude these:

    (1) The log file waits was not because of I/O; but more due to CPU starvation at that time
    (2) The cause of the CPU starvation is not known; but we know that it came in bursts.

    And we know that the log file posting from the log buffer occurs when one these events occurs:

    (1) Log buffer is 1/3 full
    (2) log buffer is 1 MB full
    (3) checkpoint occurs
    (4) commit is issued

    So, it possible that any of the above events occurred at that time. Most likely, #3 (checkpoint) occurred, which caused a burst of I/O. Since SRDF was synchronous, the burst of I/O consumed a lot of CPU cycles in realtively small timeframe, which accounts for the surge in run queues.

    Does it sound like an explanation?

    We use SRDF in asynch more for DR; and we do see a lot of CPU waiting for I/O whenever checkpoint occurs. I have seen that the more the number of datafiles, the longer the checkpoint process and at the end of the process, the I/O waits just surge. When disabled the SRDF/A activity, the surges were stil there but were dramatically reduced in amplitude.

    Thanks and regards,

    Arup

    Comment by Arup Nanda — October 21, 2008 @ 2:41 pm

  12. Hi Arup,

    My point was the difference between log file parallel write and the log file sync wait time. Of course I/O time plays a role, but much more on the log file write time (log file parallel write) than on the checkpoint completion for which there was no significant foreground wait. From the post:

    “The average log file parallel write wait over the same hours is not good, but far from the by the 101ms log file sync:

    Event Avg wait Time (ms)
    ——————————————
    log file parallel write 37

    This fits with the array I/O times.”

    Since then, we dramatically upgraded the server (from 4 Sparc III to an 8 Sparc IV 2 cores), but still on the same array (though we switched from array based mirroring (srdf sync) to ASM mirroring (sync also, of course)). The log file parallel write is now 2-3 ms, the run queue is close to 0, the log file sync 5-6 ms.

    regards

    Christian

    Comment by christianbilien — November 3, 2008 @ 10:15 am

  13. Hello Christian,

    I am currently investigating performance problem to with log file sync and log file parallel write.

    What I see there is that I have more log file parallel write events per second than log file sync events.
    The number of log file parallel writes is about twice the number of log file syncs.

    Almost every source I read assumes that because of the piggy backing there is 1 log file parallel write for 1 or more log file syncs.

    Regards Hans-Peter

    Comment by Hans-Peter Sloot — November 6, 2008 @ 11:53 am

  14. [...] 30th Nov 2008:] Here’s another interesting article on a closely related topic from Christian Bilien – in this case from a 10g [...]

    Pingback by Log file write « Oracle Scratchpad — November 30, 2008 @ 10:26 pm

  15. Hi,

    I am also investigating this same issue on siebel database, we are seeing high log file sync wait while importing siebel repository import( not oracle imp/exp). where the commit is issued every 2 to 4 records, there are 297 tables to import with 0~million records in each table. in oracle terms its just purly insert.. sql statements. usually it used to take anywhere btw 2 to 4 hrs, but now its taking about 10 to 14 hrs. main wait event is log file sync. We are suspecting I/O related issues from storage, when i move this redo logs to /tmp [ on solaris] i.e RAM for testing purpose. then the process completed it 1.5 hrs and i don’t see any more log file sync wait.

    Now does it mean that its the issues with I/O ? which other areas i can look for more clues ?

    Thanks
    CSReddy

    Comment by CSReddy — February 8, 2009 @ 9:10 am

  16. Here are AWR report details while redo are on netapps file storage with Veritas file system.

    LOG FILE SYNC 87 AVG WAIT TIME(ms)
    Log file parallel write 35 avg wait time(ms)

    Statistic Total per Second per Trans
    —————————————————————————–
    redo write time 264,645 74.53 2.23
    user commits 118,617 33.40 1.00

    based on above method calculations..

    745/33 = 22.5ms redo write time

    87-22.5-35=29ms

    CSReddy

    Comment by CSReddy — February 8, 2009 @ 9:27 am

  17. Log file parallel write 35 avg wait time(ms) ==> This is very high indeed for a netapp file storage. You should get at worst a few milliseconds here. I’m sure you know you are likely to lose your database by putting your log files on /tmp if your machine reboots.

    Regards

    Christian

    Comment by christianbilien — February 8, 2009 @ 7:04 pm

  18. [...] Bilien – Log file sync wait GA_googleAddAttr("AdOpt", "1"); GA_googleAddAttr("Origin", "other"); [...]

    Pingback by Log File Write And Waits « Vishal Gupta's Blog — June 18, 2011 @ 8:02 pm


RSS feed for comments on this post. TrackBack URI

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: