Recover database after disk loss (Including Online redo logs)

Recover database after disk loss (Doc ID 230829.1)

Loss due to Disk Failure
------------------------
What can we lose due to disk failure:
A) Control files
B) Redo log files
C) Archivelog files
D) Datafiles
E) Parameter file or SPFILE
F) Oracle software installation

Detecting disk failure
-----------------------
1) Run copy utilities like "dd" on unix
2) If using RAID mechanisms like RAID 5, parity information may mask 
    the disk failure and more vigorous check would be needed
3) As always, check the Operating system log files
4) Another obvious case would be when the disk could not be seen
    or mounted by the OS.
5) On the Oracle side, run dbv if the file affected is a datafile
6) The best way to detect disk failure is by running Hardware 
diagnostic tools and OS specific disk utilities.

Next Action
------------
Once the type of failure is identified, the next step is to rectify them.

Options could be:
(1) Replace the corrupted disk with a new one and mount them with 
     the same name (say /oracle or D:\)
(2) Replace the corrupted disk with a new one and mount them with 
     a different name (say /oracle1 as the new mount point)
(3) Decide to use another existing disk mounted with a different name
     (say /oracle2)

The most common methods are (1) AND (3).

Oracle Recovery
---------------
Once the disk problem is sorted, the next step is to perform recovery
at the Oracle level. This would depend on the type of files that is lost (see
"Loss due to Disk Failure" section) and also on the type of disk recovery done
as mentioned in the "Next Action" section above.

(A) Control Files
------------------
Normally, we have multiplexing of controlfiles and they are expected to be
placed in different disks.

If one or more controlfile is/are lost,mount will fail as shown below:
SQL> startup
Oracle Instance started
....
ORA-00205: error in identifying controlfile, check alert log for more info

You can verify the controlfile copies using:
SQL> select * from v$controlfile;

   **If atleast one copy of the controlfile is not affected by the disk failure, 
   When the database is shutdown cleanly:
   (a) Copy a good copy of the controlfile to the missing location
   (b) Start the database 

   Alternatively, remove the lost control file location specified in the
   init parameter control_files and start the database.

   **If all copies of the controlfile are lost due to the disk failure, then:
   Check for a backup controlfile. Backup controlfile is normally taken using 
   either of the following commands:
   (a) SQL> alter database backup controlfile to '/backup/control.ctl';
    -- This would have created a binary backup of the current controlfile --

    -->If the backup was done in binary format as mentioned above, restore the 
       file to the lost controlfile locations using OS copying utilities.
    --> SQL> startup mount;
    --> SQL> recover database using backup controlfile;
    --> SQL> alter database open;

   (b) SQL> alter database backup controlfile to trace;
    -- This would have created a readable trace file containing create controlfile
    script --

    --> Edit the trace file created (check user_dump_dest for the location) and
        retain the SQL commands alone. Save this to a file say cr_ctrl.sql
    --> Run the script
    
    SQL> @cr_ctrl

    This would create the controlfile, recover database and open the database.

    ** If no copy of the controlfile or backup is available, then create a controlfile
    creation script using the datafile and redo log file information. Ensure that the
    file names are listed in the correct order as in FILE$.
    Then the steps would be similar to the one followed with cr_ctrl.sql script.


Note that all controlfile related SQL maintenance operations are done in the 
database nomount state


(B) Redo logs
    ---------
In normal cases, we would not have backups of online redo log files. But the 
inactive logfile changes could already have been checkpointed on the datafiles
and even archive log files may be available.


Recovering After the Loss of All Members of an Online Redo Log Group

If a media failure damages all members of an online redo log group, then different scenarios can occur depending on the type of online redo log group affected by the failure and the archiving mode of the database.

If the damaged log group is active, then it is needed for crash recovery; otherwise, it is not.


If the group is . . . Then . . . And you should . . .

Inactive It is not needed for crash recovery Clear the archived or unarchived group.
Active It is needed for crash recovery Attempt to issue a checkpoint and clear the log; if impossible, then you must restore a backup and perform incomplete recovery up to the most recent available redo log.
Current It is the log that the database is currently writing to Attempt to clear the log; if impossible, then you must restore a backup and perform incomplete recovery up to the most recent available redo log.




SQL> startup mount
     Oracle Instance Started
     Database mounted
     ORA-00313: open failed for members of log group 1 of thread 1
     ORA-00312: online log 1 thread 1: '/ORACLE/ORADATA/H817/REDO01.LOG'
     ORA-27041: unable to open file
     OSD-04002: unable to open file
     O/S-Error: (OS 2) The system cannot find the file specified.

** Verify if the lost redolog file is Current or not.
     SQL> select * from v$log;
     SQL> select * from v$logfile; 

     --> If the lost redo log is an Inactive logfile, you can clear the logfile:

     SQL> alter database clear logfile '/ORACLE/ORADATA/H817/REDO01.LOG';

     Alternatively, you can drop the logfile if you have atleast two other   
     logfiles:
     SQL> alter database drop logfile group 1;

     
     --> If the logfile is the Current logfile, then do the following:
     SQL> recover database until cancel;
         
     Type Cancel when prompted

     SQL>alter database open resetlogs;

     
     The 'recover database until cancel' command can fail with the following 
     errors:
     ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error 
     below
     ORA-01194: file 1 needs more recovery to be consistent
     ORA-01110: data file 1: '/ORACLE/ORADATA/H817/SYSTEM01.DBF'

     In this case , restore an old backup of the database files and apply the
     archive logs to perform incomplete recovery.
     --> restore old backup
     SQL> startup mount
     SQL> recover database until cancel using backup controlfile;
     SQL> alter database open resetlogs;


If the database is in noarchivelog mode and if ORA-1547, ORA-1194 and ORA-1110 errors occur, then you would have restore from an old backup and start the database.


Note that all redo log maintenance operations r done in the database mount state


(C) Archive logs
-----------------
If the previous archive log files alone have been lost, then there is not much
to panic.
** Backup the current database files using hot or cold backup which would ensure
that you would not need the missing archive logs

(D) Datafiles
--------------
This obviously is the biggest loss.

(1) If only a few sectors are damaged, then you would get ora-1578 when 
accessing those blocks.
 --> Identify the object name and type whose block is corrupted by querying dba_extents
 --> Based on the object type, perform appropriate recovery
 --> Check metalink Note:28814.1 for resolving this error

(2) If the entire disk is lost, then one or more datafiles may need to be 
recovered . 
  SQL> startup
  ORACLE instance started.
  ...
  Database mounted.
  ORA-01157: cannot identify/lock data file 3 - see DBWR trace file
  ORA-01110: data file 3: '/ORACLE/ORADATA/H817/USERS01.DBF'

Other possible errors are ORA-00376 and ORA-1113

The views and queries to identify the datafiles would be:
   SQL> select file#,name,status from v$datafile;
   SQL> select file#,online,error from v$recover_file;


** If restoring to a replaced disk mounted with the same name, then :
  (1) Restore the affected datafile(s) using OS copy/restore commands from the 
      previous backup
  (2) Perform recovery based on the type of datafile affected namely SYSTEM, 
      ROLLBACK or UNDO, TEMP , DATA or INDEX.
  (3) The recover commands could be 'recover database', 'recover tablespace'
      or 'recover datafile' based on the loss and the database state

** If restoring to a different mount point, then :
  (1) Restore the files to the new location from a previous backup
  (2) SQL> STARTUP MOUNT
  (3) SQL> alter database rename file '/old path_name' to 'new path_name';     
      -- Do this renaming for all datafiles affected. --
  (4) Perform recovery based on the type of datafile affected namely SYSTEM, 
      ROLLBACK or UNDO, TEMP , DATA or INDEX.
  (5) The recover commands could be 'recover database', 'recover tablespace'
      or 'recover datafile' based on the loss and the database state

The detailed steps of recovery based on the datafile lost and the Oracle error 
are outlined in the articles referenced at the end of this note.


  NOARCHIVELOG DATABASE
  =====================
  The loss mentioned in (A),(B) and (D) would be different in this case
  wherever archive logs are involved. 

  We will discuss the datafile loss scenarios here:

  (a) If the datafile lost is a SYSTEM datafile, restore the complete
      database from the previous backup and start the database.
  (b) If the datafile lost is Rollback related datafile with active transactions,
      restore from the previous backup and start the database.
  (c) If the datafile contains rollback with no active rollback segments, you can
      offline the datafile (after commenting the rollback_segments parameter 
      assuming that they are private rollback segments) and open the database. 
  (d) If the datafile is temporary, offline the datafile and open the database. 
      Drop the tablespace and recreate the tablespace.
  (e) If the datafile is DATA or INDEX, 
      **Offline the tablespace and start the database.
      **If you have a previous backup, restore it to a separate location.
      **Then export the objects in the affected tablespace ( using User or 
        table level export).
      **Create the tablespace in the original database.
      **Import the objects exported above.

      If the database is 8i or above, you can also use Transportable tablespace
      feature.


(E) Parameter file
    ---------------
This is not a major loss and can be easily restored. Options are:
  (1) If there is a backup, restore the file
  (2) If there is no backup, copy sample file or create a new file and add the 
      required parameters. Ensure that the parameters db_name, control_files,
       db_block_size, compatible are set correctly
  (3) If the spfile is lost, you can create it from the init parameter file if it is available


(F) Oracle Software Installation
    ----------------------------
There are two ways to recover from this scenario:
  (1) If there is a backup of the Oracle home and Oracle Inventory, restore
      them to the respective directories. Note if you change the Oracle Home, 
      the inventory would not be aware of thid new path and you would not be
      able to apply patchsets. Also restore to the same OS user and group.

  (2) Perform a fresh Install, bringing it to the same patchset level


PRACTICAL SCENARIO
==================

In most cases, when a disk is lost, more than one type of file could be lost.
The recovery in this scenario would be:
  (1) A combination of each of these data loss recovery scenarios
  (2) Perform entire database restore from the more recent backup and apply
      archive logs to perform recovery. This is a highly preferred method 
      but could be time consuming.

If the group is . . .	Then . . .	And you should . . .
Inactive	It is not needed for crash recovery	Clear the archived or unarchived group.
Active	It is needed for crash recovery	Attempt to issue a checkpoint and clear the log; if impossible, then you must restore a backup and perform incomplete recovery up to the most recent available redo log.
Current	It is the log that the database is currently writing to	Attempt to clear the log; if impossible, then you must restore a backup and perform incomplete recovery up to the most recent available redo log.

===============================================================================


Recovering After the Loss of Online Redo Log Files: Scenarios

If a media failure has affected the online redo logs of a database, then the appropriate recovery procedure depends on the following:

The configuration of the online redo log: mirrored or non-mirrored
The type of media failure: temporary or permanent
The types of online redo log files affected by the media failure: current, active, unarchived, or inactive


Table 6-1 displays V$LOG status information that can be crucial in a recovery situation involving online redo logs.

Table 6-1 STATUS Column of V$LOG  

Status Description


UNUSED

The online redo log has never been written to.


CURRENT

The log is active, that is, needed for instance recovery, and it is the log to which Oracle is currently writing. The redo log can be open or closed.


ACTIVE

The log is active, that is, needed for instance recovery, but is not the log to which Oracle is currently writing.It may be in use for block recovery, and may or may not be archived.


CLEARING

The log is being re-created as an empty log after an ALTER DATABASE CLEAR LOGFILE statement. After the log is cleared, then the status changes to UNUSED.


CLEARING_CURRENT

The current log is being cleared of a closed thread. The log can stay in this status if there is some failure in the switch such as an I/O error writing the new log header.


INACTIVE

The log is no longer needed for instance recovery. It may be in use for media recovery, and may or may not be archived.



The following sections describe the appropriate recovery strategies for these situations:

Recovering After Losing a Member of a Multiplexed Online Redo Log Group
Recovering After the Loss of All Members of an Online Redo Log Group


Recovering After Losing a Member of a Multiplexed Online Redo Log Group

If the online redo log of a database is multiplexed, and if at least one member of each online redo log group is not affected by the media failure, then Oracle allows the database to continue functioning as normal. Oracle writes error messages to the LGWR trace file and the alert_SID.log of the database.

Solve the problem by taking one of the following actions:

If the hardware problem is temporary, then correct it. LGWR accesses the previously unavailable online redo log files as if the problem never existed.
If the hardware problem is permanent, then drop the damaged member and add a new member by using the following procedure.




Note:
The newly added member provides no redundancy until the log group is reused.







To replace a damaged member of a redo log group:

Locate the filename of the damaged member in V$LOGFILE. The status is INVALID if the file is inaccessible:SELECT GROUP#, STATUS, MEMBER 
FROM V$LOGFILE
WHERE STATUS='INVALID';

GROUP#    STATUS       MEMBER
-------   -----------  ---------------------
0002      INVALID       /oracle/dbs/log2b.f



Drop the damaged member. For example, to drop member log2b.f from group 2, issue:ALTER DATABASE DROP LOGFILE MEMBER '/oracle/dbs/log2b.f';



Add a new member to the group. For example, to add log2c.f to group 2, issue:ALTER DATABASE ADD LOGFILE MEMBER '/oracle/dbs/log2c.f' TO GROUP 2;



If the file you want to add already exists, then it must be the same size as the other group members, and you must specify REUSE. For example:
ALTER DATABASE ADD LOGFILE MEMBER '/oracle/dbs/log2b.f' REUSE TO GROUP 2;




Recovering After the Loss of All Members of an Online Redo Log Group

If a media failure damages all members of an online redo log group, then different scenarios can occur depending on the type of online redo log group affected by the failure and the archiving mode of the database.

If the damaged log group is inactive, then it is not needed for crash recovery; if it is active, then it is needed for crash recovery.

If the group is . . . Then . . . And you should . . .


Inactive

It is not needed for crash recovery

Clear the archived or unarchived group.


Active

It is needed for crash recovery

Attempt to issue a checkpoint and clear the log; if impossible, then you must restore a backup and perform incomplete recovery up to the most recent available log.


Current

It is the log that Oracle is currently writing to

Attempt to clear the log; if impossible, then you must restore a backup and perform incomplete recovery up to the most recent available log.



Your first task is to determine whether the damaged group is active or inactive.

To determine whether the damaged groups are active:

Locate the filename of the lost redo log in V$LOGFILE and then look for the group number corresponding to it. For example, enter:SELECT GROUP#, STATUS, MEMBER FROM V$LOGFILE;

GROUP#    STATUS       MEMBER
-------   -----------  ---------------------
0001                    /oracle/dbs/log1a.f
0001                    /oracle/dbs/log1b.f
0002      INVALID       /oracle/dbs/log2a.f
0002      INVALID       /oracle/dbs/log2b.f
0003                    /oracle/dbs/log3a.f
0003                    /oracle/dbs/log3b.f



Determine which groups are active. For example, enter:SELECT GROUP#, MEMBERS, STATUS, ARCHIVED FROM V$LOG;

GROUP#  MEMBERS           STATUS     ARCHIVED
------  -------           ---------  -----------
 0001   2                 INACTIVE   YES
 0002   2                 ACTIVE     NO
 0003   2                 CURRENT    NO



If the affected group is inactive, follow the procedure in "Losing an Inactive Online Redo Log Group". If the affected group is active (as in the preceding example), then follow the procedure in "Losing an Active Online Redo Log Group".


Losing an Inactive Online Redo Log Group

If all members of an online redo log group with INACTIVE status are damaged, then the procedure depends on whether you can fix the media problem that damaged the inactive redo log group.

If the failure is . . . Then . . .


Temporary

Fix the problem. LGWR can reuse the redo log group when required.


Permanent

The damaged inactive online redo log group eventually halts normal database operation. Reinitialize the damaged group manually by issuing the ALTER DATABASE CLEAR LOGFILE statement as described in this section.



You can clear an active redo log group when the database is open or closed. The procedure depends on whether the damaged group has been archived.

To clear an inactive, online redo log group that has been archived:

If the database is shut down, then start a new instance and mount the database:STARTUP MOUNT



Reinitialize the damaged log group. For example, to clear redo log group 2, issue the following statement:ALTER DATABASE CLEAR LOGFILE GROUP 2;





To clear an inactive, online redo log group that has not been archived:

Clearing an unarchived log allows it to be reused without archiving it. This action makes backups unusable if they were started before the last change in the log, unless the file was taken offline prior to the first change in the log. Hence, if you need the cleared log file for recovery of a backup, then you cannot recover that backup. Also, it prevents complete recovery from backups due to the missing log.

If the database is shut down, then start a new instance and mount the database:STARTUP MOUNT



Clear the log using the UNARCHIVED keyword. For example, to clear log group 2, issue:ALTER DATABASE CLEAR LOGFILE UNARCHIVED GROUP 2;



If there is an offline datafile that requires the cleared unarchived log to bring it online, then the keywords UNRECOVERABLE DATAFILE are required. The datafile and its entire tablespace have to be dropped because the redo necessary to bring it online is being cleared, and there is no copy of it. For example, enter:
ALTER DATABASE CLEAR LOGFILE UNARCHIVED GROUP 2 UNRECOVERABLE DATAFILE;



Immediately back up the database with an operating system utility as described in "Making User-Managed Backups of the Whole Database". Now you can use this backup for complete recovery without relying on the cleared log group. For example, enter:% cp /disk1/oracle/dbs/*.f /disk2/backup



Back up the database's control file using the ALTER DATABASE statement as described in "Backing Up the Control File to a Binary File". For example, enter:ALTER DATABASE BACKUP CONTROLFILE TO '/oracle/dbs/cf_backup.f';




Failure of CLEAR LOGFILE Operation

The ALTER DATABASE CLEAR LOGFILE statement can fail with an I/O error due to media failure when it is not possible to:

Relocate the redo log file onto alternative media by re-creating it under the currently configured redo log filename
Reuse the currently configured log filename to re-create the redo log file because the name itself is invalid or unusable (for example, due to media failure)


In these cases, the ALTER DATABASE CLEAR LOGFILE statement (before receiving the I/O error) would have successfully informed the control file that the log was being cleared and did not require archiving. The I/O error occurred at the step in which the CLEAR LOGFILE statement attempts to create the new redo log file and write zeros to it. This fact is reflected in V$LOG.CLEARING_CURRENT.

Losing an Active Online Redo Log Group

If the database is still running and the lost active log is not the current log, then issue the ALTER SYSTEM CHECKPOINT statement. If successful, then the active log is rendered inactive, and you can follow the procedure in"Losing an Inactive Online Redo Log Group". If unsuccessful, or if your database has halted, then perform one of procedures in this section, depending on the archiving mode.

Note that the current log is the one LGWR is currently writing to. If a LGWR I/O fails, then LGWR terminates and the instance crashes. In this case, you must restore a backup, perform incomplete recovery, and open the database with the RESETLOGS option.

To recover from loss of an active online redo log group in NOARCHIVELOG mode:

If the media failure is temporary, then correct the problem so that Oracle can reuse the group when required.
Restore the database from a consistent, whole database backup (datafiles and control files) as described in "Restoring Datafiles". For example, enter:% cp /disk2/backup/*.f /disk1/oracle/dbs



Mount the database:STARTUP MOUNT



Because online redo logs are not backed up, you cannot restore them with the datafiles and control files. In order to allow Oracle to reset the online redo logs, you must first mimic incomplete recovery:RECOVER DATABASE UNTIL CANCEL
CANCEL



Open the database using the RESETLOGS option:ALTER DATABASE OPEN RESETLOGS;



Shut down the database consistently. For example, enter:SHUTDOWN IMMEDIATE



 Make a whole database backup as described in "Making User-Managed Backups of the Whole Database". For example, enter:% cp /disk1/oracle/dbs/*.f /disk2/backup





To recover from loss of an active online redo log group in ARCHIVELOG mode:

If the media failure is temporary, then correct the problem so that Oracle can reuse the group when required. If the media failure is not temporary, then use the following procedure.

Begin incomplete media recovery. Use the procedure given in "Performing Incomplete User-Managed Media Recovery", recovering up through the log before the damaged log.
Ensure that the current name of the lost redo log can be used for a newly created file. If not, then rename the members of the damaged online redo log group to a new location. For example, enter:ALTER DATABASE RENAME FILE "/oracle/dbs/log_1.rdo" TO "/temp/log_1.rdo";
ALTER DATABASE RENAME FILE "/oracle/dbs/log_2.rdo" TO "/temp/log_2.rdo";



Open the database using the RESETLOGS option:ALTER DATABASE OPEN RESETLOGS;




Note:
All updates executed from the endpoint of the incomplete recovery to the present must be re-executed.







Loss of Multiple Redo Log Groups

If you have lost multiple groups of the online redo log, then use the recovery method for the most difficult log to recover. The order of difficulty, from most difficult to least difficult, follows:

The current online redo log
An active online redo log
An unarchived online redo log
An inactive online redo log


Recovering After the Loss of Archived Redo Log Files: Scenario

If the database is operating in ARCHIVELOG mode, and if the only copy of an archived redo log file is damaged, then the damaged file does not affect the present operation of the database. The following situations can arise, however, depending on when the redo log was written and when you backed up the datafile.

If you backed up . . . Then . . .


All datafiles after the filled online redo log group (which is now archived) was written

The archived version of the filled online redo log group is not required for complete media recovery operation.


A specific datafile before the filled online redo log group was written

If the corresponding datafile is damaged by a permanent media failure, use the most recent backup of the damaged datafile and perform incomplete recovery up to the damaged log.





Caution:
If you know that an archived redo log group has been damaged, immediately back up all datafiles so that you will have a whole database backup that does not require the damaged archived redo log.



==============================================================

Oracle DBA Information at one place

Search This Blog

Recover database after disk loss (Including Online redo logs)

Recovering After the Loss of All Members of an Online Redo Log Group

Recovering After the Loss of Online Redo Log Files: Scenarios

Table 6-1 STATUS Column of V$LOG

Recovering After Losing a Member of a Multiplexed Online Redo Log Group

Recovering After the Loss of All Members of an Online Redo Log Group

Losing an Inactive Online Redo Log Group

Failure of CLEAR LOGFILE Operation

Losing an Active Online Redo Log Group

Loss of Multiple Redo Log Groups

Recovering After the Loss of Archived Redo Log Files: Scenario

Comments

Post a Comment

Status	Description
`UNUSED`	The online redo log has never been written to.
`CURRENT`	The log is active, that is, needed for instance recovery, and it is the log to which Oracle is currently writing. The redo log can be open or closed.
`ACTIVE`	The log is active, that is, needed for instance recovery, but is not the log to which Oracle is currently writing.It may be in use for block recovery, and may or may not be archived.
`CLEARING`	The log is being re-created as an empty log after an `ALTER` `DATABASE` `CLEAR` `LOGFILE` statement. After the log is cleared, then the status changes to `UNUSED`.
`CLEARING_CURRENT`	The current log is being cleared of a closed thread. The log can stay in this status if there is some failure in the switch such as an I/O error writing the new log header.
`INACTIVE`	The log is no longer needed for instance recovery. It may be in use for media recovery, and may or may not be archived.

If the failure is . . .	Then . . .
Temporary	Fix the problem. LGWR can reuse the redo log group when required.
Permanent	The damaged inactive online redo log group eventually halts normal database operation. Reinitialize the damaged group manually by issuing the `ALTER` `DATABASE` `CLEAR` `LOGFILE` statement as described in this section.

If you backed up . . .	Then . . .
All datafiles after the filled online redo log group (which is now archived) was written	The archived version of the filled online redo log group is not required for complete media recovery operation.
A specific datafile before the filled online redo log group was written	If the corresponding datafile is damaged by a permanent media failure, use the most recent backup of the damaged datafile and perform incomplete recovery up to the damaged log.