Troubleshooting: 'Log file sync' Waits

Troubleshooting: 'Log file sync' Waits



TROUBLESHOOTING STEPS


What is a 'log file sync' wait?

When a user session commits, the session's redo information needs to be flushed from memory to the redo logfile to make it permanent.

At the time of commit, the user session will post the LGWR to write the log buffer (containing the current unwritten redo, including this session's redo information) to the redo log file. When the LGWR has finished writing, it will post the user session to notify it that this has completed. The user session waits on 'log file sync' while waiting for LGWR to post it back to confirm all redo changes have made it safely on to disk.

The time between the user session posting the LGWR and the LGWR  posting the user after the write has completed is the wait time for 'log file sync' that the user session will show.
Note that if a sync is ongoing, other sessions that want to commit (and thus flush log information) will also wait for the LGWR to sync and will also wait on 'log file sync'?

What should be collected for initial diagnosis of 'log file sync' waits ?

To initially analyse 'log file sync' waits the following information is helpful:
  • AWR report from a similar time frame and period where 'log file sync' waits are not experienced in order to use as a baseline for reasonable performance for comparison purposes
  • AWR report when 'log file sync' waits are occurring
    Note: The 2 reports should be for between 10-30 minutes each.
  • Lgwr trace file
    The lgwr trace file will show warning messages for periods when 'log file parallel wait' may be high

What causes high waits for 'log file sync'?

Waits for the 'log file sync' event can occur at any stage between a user process posting the LGWR to write redo information and the LGWR posting back the user process after the redo has been written from the log buffer to disk and the user process waking up.
For more information see:
Document:34592.1 WAITEVENT: "log file sync"

In terms of the most common causes, these are:
Details of these causes and how to troubleshoot them are outlined below:

Issues affecting LGWR's IO Performance

The primary question we are looking to answer here is "Is LGWR slow in writing to disk?". The following steps can assist determine whether this the case or not.

Compare the average wait time for 'log file sync' to the average wait time for 'log file parallel write'.

Wait event 'log file parallel' write is waited for by LGWR while the actual write operation to the redo is occurring. The duration of the event shows the time waited for the IO portion of the operation to occur. For more information on "log file parallel write" see: :
Document:34583.1 WAITEVENT: "log file parallel write" Reference Note

Looking at this event in conjunction with "log file sync" shows how much of the sync operation is spent on IO and also, by inference, how much processing time is spent on the CPU.



The example above shows high wait times for both 'log file sync' and 'log file parallel write'

If the proportion of the 'log file sync' time spent on 'log file parallel write' times is high, then most of the wait time is due to IO (waiting for the redo to be written). The performance of  LGWR in terms of IO should be examined. As a rule of thumb, an average time for 'log file parallel write' over 20 milliseconds suggests a problem with IO subsystem.
Recommendations
  • Work with the system administrator to examine the filesystems where the redologs are located with a view to improving the performance of IO.
  • Do not place redo logfiles on a RAID configuration which requires the calculation of parity, such as RAID-5 or RAID-6.
  • Do not put redo logs on Solid State Disk (SSD)
    Although generally, Solid State Disks write performance is good on average, they may endure write peaks which will highly increase waits on 'log file sync'.
    (Exception to this would be for Engineered Systems (Exadata, SuperCluster and Oracle Database Appliance) which have been optimized to use SSDs for REDO)
  • Look for other processes that may be writing to that same location and ensure that the disks have sufficient bandwidth to cope with the required capacity. If they don't then move the activity or the redo.
  • Ensure that the log_buffer is not too big. A very large log_buffer can have an adverse affect  as waits will be longer when flushes occur. When the buffer fills up, it has to write all the data into the redo log file and the LGWR will wait until the last I/O is completed.

Effect of Peaks of IO (or Bursts of IO) on 'log file sync' waits

Log writer (LGWR) tends to write in small bursts of activity as opposed to a large chunks of I/O (which is more normal). Most disks are not setup to work well with small bursts of activity and this might cause IO waits. But since, on average, the disks work fine (they are just not working well with the burst of activity) I/O vendors may report that there is no issue with the disks. This, along with other I/O on the system, may mean that rather than the I/O performance being consistent over time, there may be a burst of higher I/O followed by a period of lower I/O. When averaged out, the peak information is 'lost' in the mass of lower I/O numbers, but these peaks can cause a backup of log data that needs to be written and this manifests itself as high waits for 'log file sync' but where the average waits for 'log file parallel write' are well within normal I/O tolerences.
If you see a situation where you have high waits for 'log file sync' and the average waits for 'log file parallel write' are well within normal I/O tolerences, be aware that this could be because there are peaks in the I/O that the average is not showing you. You could use something like OSWatcher (see: Document 301137.1) to identify I/O peaks and then try to correlate these with peaks in 'log file sync' (which would occur slightly after).  If you identify peaks of I/O then work with your storage vendor to deal with the bursts of activity.

Check LGWR Traces

Even if the average wait for 'log file parallel write' may be in the normal range, there may be peaks where the write time is longer and will therefore influence waits on 'log file sync'. From 10.2.0.4, messages are written in the LGWR trace when a write to the log file takes more than 500 ms. This is quite a high threshold so a lack of messages does not necessarily mean there is no problem. The messages look similar to the following:
*** 2011-10-26 10:14:41.718
Warning: log write elapsed time 21130ms, size 1KB
(set event 10468 level 4 to disable this warning)

*** 2011-10-26 10:14:42.929
Warning: log write elapsed time 4916ms, size 1KB
(set event 10468 level 4 to disable this warning)

Note: Peaks like the following may not have a high influence on the 'log file parallel wait' if they are far between. However , if 100s of sessions are waiting for the 'log file parallel wait' to complete, the total wait for  'log file sync' can be high as the wait time will be multiplied for the  100s of sessions. Therefore it is worth investigating the reason for the high peaks in IO for the log writer.

See:
Document:601316.1 LGWR Is Generating Trace file with "Warning: Log Write Time 540ms, Size 5444kb" In 10.2.0.4 Database
Recommendations
  • Work with the system administrator to examine what else is happening at this time that may be causing the peaks in LGWR writing to disk
  • Truss of the LGWR process when the slow down is occurring may help identify where time is going

Note: These warnings can be  particularly useful for preempting potential issues. Even if a general problem in terms of the average wait time is not been seen, by highlighting extreme peaks of IO performance,  a dba can have a useful indicator that LGWR is encountering intermittent issues. These can then be resolved before they cause outages or similar.

Check to see if redo logs are large enough

A 'log file sync' operation is performed every time the redo logs switch to the next log to ensure that everything is written before the next log is started. Standard recommendations are that a log switch should occur at most once every 15 to 20 minutes. If switches occur more frequently than this, then more 'log file sync' operations will occur meaning more waiting for individual sessions.
  • Check the time between log file switches in alert.log

    Thu Jun 02 14:57:01 2011
    Thread 1 advanced to log sequence 2501 (LGWR switch)
    Current log# 5 seq# 2501 mem# 0: /opt/oracle/oradata/orcl/redo05a.log
    Current log# 5 seq# 2501 mem# 1: /opt/oracle/logs/orcl/redo05b.log
    Thu Nov 03 14:59:12 2011
    Thread 1 advanced to log sequence 2502 (LGWR switch)
    Current log# 6 seq# 2502 mem# 0: /opt/oracle/oradata/orcl/redo06a.log
    Current log# 6 seq# 2502 mem# 1: /opt/oracle/logs/orcl/redo06b.log
    Thu Nov 03 15:03:01 2011
    Thread 1 advanced to log sequence 2503 (LGWR switch)
    Current log# 4 seq# 2503 mem# 0: /opt/oracle/oradata/orcl/redo04a.log
    Current log# 4 seq# 2503 mem# 1: /opt/oracle/logs/orcl/redo04b.log

    In the above example we see log switches every 2 to 4 minutes which is at best 5 times more frequent than the recommendations.
  • You can also check the average time for log switch in the AWR report




    The example above shows that based on the information in AWR, there are 29.98 redo logs switches per hour: ~1 switch every 2 minutes. This is higher than the accepted value of 1 switch every 15-20 minutes and will have an affect on the time foreground process will need to wait for 'log file sync' waits to complete because the overhead of initiating the sync operation more than necessary.
Recommendations
Increase the size of the redo logs
Document:602066.1 How To Maintain and/or Add Redo Logs
Document:779306.1 How To Add/Increase The Size Of Redo Log Files In Rac Environment?

Excessive Application Commits

In this case the question to answer is "Is the Application Committing too Frequently?".
If it is , then the excessive commit activity can cause performance issues since commits flush redo from the log buffer to the redo logs which can cause waits for 'log file sync'.

To identify a potential high commit rate, if the average wait time for 'log file sync' is much higher than the average wait time for 'log file parallel write', then this means that most of the time waiting is not due to waiting for the redo to be written and thus slow IO is not the cause of the problem. The surplus time is CPU activity and is most commonly contention caused by over committing.

Additionally, if the average time waited on 'log file sync' is low, but the number of waits is high, then the application might be committing too frequently.

Compare the average user commits and user rollbacks to user calls

In the AWR or Statspack report, if  the  average user calls per commit/rollback  calculated as "user calls/(user commits+user rollbacks)"   is less than 30, then  commits are happening too frequently:



In the above example we see an average of 5.76 user calls  per commit which is considered high - about 5x higher that recommended.
Rule of thumb, we should expect at least 25 user calls / commit. This of course depends on the application.
Recommendations
  • If there are lots of short duration transactions, see if it is possible to group transactions together so there are fewer distinct COMMIT operations. Since it is mandatory for each commit to receive confirmation that the relevant REDO is on disk, additional commits can add significantly to the overhead. Although commits can be "piggybacked" by Oracle, reducing the overall number of commits by batching transactions can have a very beneficial effect.
  • See if any of the processing can use the COMMIT NOWAIT option (be sure to understand the semantics of this before using it).
  • See if any activity can safely be done with NOLOGGING / UNRECOVERABLE options.

Other Wait Events that may be relevant

Check the AWR report to see if there are other events related to the LGWR that are also shown to be taking a significant amount of time as this may give a clue as to what is causing the problem.
Both foreground and background events should be checked.
For example the following AWR shows high waits for other foreground and background wait events that may actually indicate a problem with the transfer of redo logs to a remote location and result in the foreground processes waiting on "log file sync".
Foreground Wait Events
Background Wait Events

Adaptive Log File Sync

Adaptive Log File sync was introduced in 11.2. The parameter controlling this feature, _use_adaptive_log_file_sync, is set to false by default in 11.2.0.1 and 11.2.0.2.
In 11.2.0.3 the default is now true. When enabled, Oracle can switch between the 2 methods:
  • Post/wait, traditional method for posting completion of writes to redo log
  • Polling, a new method where the foreground process checks if the LGWR has completed the write.
For more information on this feature see:
Document 1541136.1 Waits for "log file sync" with Adaptive Polling vs Post/Wait Choice Enabled

Redo Synch Time Overhead

The statistic, 'redo synch time overhead', included in 11.2.0.4 and 12c, records the difference between ideal and actual, 'log file sync' times.  If the difference is small, then large 'log file sync' waits may be attributed to outlier 'log file parallel write' times.
Known Issues
The following notes may help with known issues in specific environments:
Document 1548261.1 High Waits for 'Log File Sync': Known Issue Checklist for 11.2

Document 1229104.1 LOG FILE SYNC WAITS SPIKES DURING RMAN ARCHIVELOG BACKUPS
Document 1541136.1 Waits for "log file sync" with Adaptive Polling vs Post/Wait Choice Enabled

Sparc:
Document 1523164.1 SPARC: Reducing High Waits on 'log file sync' on Oracle Solaris SPARC by Increasing Priority of Log Writer
Document 1205673.1 'Log File Sync' problem on a Sun Server: A Typical Source for LOGFILE SYNC Performance Problems
Document 10318123.8 Bug 10318123 - Solaris: LGWR regularly stalls for 3 seconds at a time

AIX:
Document 1318709.1 AIX: Things To Check When Seeing Long Log File Sync Time in 11.2

Veritas/Symantec ODM:
Document 1278149.1 Intermittent Long 'log file sync' Waits on Veritas ODM, LGWR Posting Long Write Times, I/O Portion of Wait Minimal
Document 13551402.8 High "log file parallel write" and "log file sync" after upgrading 11.2 with Veritas/Symantec ODM

SUSE Linux:
Document 1942134.1 Higher Log File Synch After OS Upgrade from SUSE Linux Enterprise Server SLES11 SP1 to SLES11 SP3

Other diagnostics to help analyze 'log file sync' waits?

The following script looks at important parameters involved in 'log file sync' waits, 'log file sync' wait histogram data and related information.
Document:1064487.1 Script to Collect Log File Sync Diagnostic Information (lfsdiag.sql)

Data Guard and 'Log File Sync'

Additional time may be taken when Data Guard with synchronous (SYNC) transport and commit WAIT defaults are configured. In the case of Data Guard,  although the above tuning steps still apply, the time for the network write and the RFS/redo write to the standby redo logs need to be considered as well.  How 'Log File Sync' applies to Data Guard is explained in the following white paper:
Document:387174.1 MAA - Data Guard Redo Transport and Network Best Practices.
 Information on the setup of the Primary and Standby Databases can be gathered using the the scripts from the following notes:
Document:241374.1 Script to Collect Data Guard Primary Site Diagnostic Information
Document:241438.1 Script to Collect Data Guard Physical Standby Diagnostic Information

Troubleshooting Other Issues

For guidance troubleshooting other performance issues see:

Document:1377446.1 Troubleshooting Performance Issues

========================================================
Troubleshooting Assistant: Troubleshooting Performance Issues (Doc ID 1543445.2)

Note---> This informationmation taken from oracle metalink. all copy rights oracle only.

Comments