Diagnosing Connection Problems with an enabled Data Guard Broker Configuration

Diagnosing Connection Problems with an enabled Data Guard Broker Configuration (Doc ID 745201.1)



APPLIES TO:

Oracle Database - Enterprise Edition - Version 9.2.0.1 to 12.1.0.1 [Release 9.2 to 12.1]
Information in this document applies to any platform.
Data Guard Broker 
Network


PURPOSE

You sometimes face Problems with the Data Guard Broker. It looks like the DMON-Processes of the Primary and Standby Databases cannot reach each other, Log Transport fails or any Commands in DGMGRL like 'enable configuration'. This Guide explains how the Data Guard Broker connects to the Databases and how to Troubleshoot.

Would you like to explore this Topic further with other Oracle Customers, Oracle Employees and Industry Experts ??
You can discuss this Note, show your Experiences or ask Questions about it directly right at the Bottom of this Note in the Discussion Thread about this Document.
If you want to discover Discussions about other Articles and Subjects or even post new Discussions you can access the My Oracle Support Community Page for High Availability Data Guard

TROUBLESHOOTING STEPS

Data Guard Broker Connection Model

When you start the Data Guard Broker (DMON)-Process on your Primary and Standby Database, it will start and register Services on the Local Listener:
_DGB.: This Service is used by the DMON-Processes to communicate between each other

_XPT.: This Service is used for Log Transport Services and FAL (corresponding Initialization Parameters are set once a Configuration is enabled) - (Oracle 10g only)

Starting with Oracle 11.1.0.x you can use your own Service and/or TNS-Alias used for Log Transport Services and FAL (corresponding Initialization Parameters are set once a Configuration is enabled). This Service is specified by the Data Guard Broker Property 'DGConnectIdentifier'. If you specify a TNS-Alias here you have to ensure this Alias is configured on all TNSNAMES.ORA's in the Data Guard Configuration.
If you want to perform Operations with the Data Guard Broker that require a Restart of any Database using DGMGRL (like Switchover), you also have to register an additional Static Service with the local Listener
_DGMGRL.: This Service is used to connect to a Database by the Data Guard Broker while it is shutdown
Starting with Oracle 11.2.0.x you can also setup and use your own Static Listener Entry. You will then have to set the Data Guard Broker Property 'StaticConnectIdentifier' to this Entry. The default Value still points to _DGMGRL-Entry.
Troubleshooting:

a) Problems with the Log Transport Services/FAL:
  • Review ALERT.LOG and DRC.LOG-Files for exact Error
  • _XPT. is registered with the Local Listener (10g only)
  • Initialization Parameter 'local_listener' and Data Guard Broker Property 'LocalListenerAddress' is set and pointing to the correct Listener (10g only)
  • Verify you have not setup further duplicate Log Transport Services
  • Ensure the TNS-Alias if specified in 'DGConnectIdentifier' is setup on all TNSNAMES.ORA in the Data Guard Environment (11g onward)
  • Verify that the specified Service in the 'DGConnectIdentifier' exists and is registered with the corresponding Listener
b) DMON-Communication
  • Review DRC.LOG from all Instances for any Issues
  • _DGB. is registered with the Local Listener
  • Initialization Parameter 'local_listener' and Data Guard Broker Property 'LocalListenerAddress' is set and pointing to the correct Listener (10g only)
  • If this is all fine and you cannot determine any Cause, try to set        
event = '16634 trace name context forever, level 1'
and restart the Data Guard Broker Process. This will record all Connectstrings used by the Data Guard Broker in the DRC.LOG-Files. So you can see which exact Connect is actually being used by the Data Guard Broker (10g)
  • Set the Data Guard Broker Configuration Property 'TraceLevel' to 'SUPPORT' to gather DMON Tracing (11.2.0.3 onward)
  -> It replaces the Event 16634 and can be set dynamically for the Configuration. Note that since it is Configuration-specific, it gathers Tracing Information from all Databases. The default Value is 'USER'. The Tracing Information is still further logged in the DRC.LOG.
To enable the Tracing run
DGMGRL> edit configuration set property 'TraceLevel' = 'SUPPORT';
   -> Tracing will be enabled immediately, there is no restart of the Data Guard Broker Processes required.
To disable Tracing again once the required Information is gathered, you can reset the Property to its default Value:
DGMGRL> edit configuration set property 'TraceLevel' = 'USER';



======================================================


Troubleshooting - Heartbeat failed to connect to standby (Doc ID 1432367.1)


APPLIES TO:

Oracle Database - Enterprise Edition - Version 9.2.0.1 and later
Information in this document applies to any platform.
***Checked for relevance on 20-Aug-2013*** 

PURPOSE

The Purpose of this Document is to troubleshoot the generic Error

Heartbeat failed to connect to standby

in the Primary ALERT.LOG. It shows most common and possible Causes along with Solutions to get rid of this Problem.

TROUBLESHOOTING STEPS

Introduction

Once Log Transport Services from a Primary to a Standby Database are setup correctly and the Archive Destination is enabled and active, there will be a Heartbeat-Ping between the Primary and Standby Database. This Ping is being performed by a dedicated ARCn-Process on the Primary Database to an associated RFS-Process on the Standby. You can find out about this Process if you have a Look into the ALERT.LOG of the Primary Database and search for an Entry like this:
ARC1 started with pid=21, OS id=6585
...
ARC1: Becoming the heartbeat ARCH
-> In this Case we can find that ARC1 with pid 21 (OS-pid 6585) is the current Heartbeat ARCn-Process.

This Process tries to ping an associated RFS-Process on the Standby Database. If you look into the Standby ALERT.LOG you can find Entries like this:
RFS[2]: Assigned to RFS process 6621
RFS[2]: Identified database type as 'physical standby': Client is ARCH pid 6585
-> So here RFS-Process with OS-pid 6621 is the corresponding RFS-Process for the Heartbeat Ping

Note that this can be quite dynamic since RFS-Processes are created and terminated on Need.
If the Primary ARCn-Process is not able to reach a corresponding RFS-Process this Error is raised in the Primary ALERT.LOG together with the corresponding Reason.


Diagnosing Problems:

If the Heartbeat Ping fails you typically get an Error along with this Message in the following Section we discuss common Errors and how to solve them.

The Errors in general look like this:

PING[ARCn]: Heartbeat failed to connect to standby ''. Error is xxxxx

General Points to verify first:
  • The Standby Database must at least be mounted to create a RFS-Process and make the Heartbeat Ping happen. So first of all you should always verify if the Standby Database is at least in 'mount'-Status and registered its Services with the correct Listener on the Standby Site.
  • Verify if you are able to connect to the Standby Database from the Primary Site using the same Connect Descriptor or TNS-Alias used in the corresponding log_archive_dest_n:
SQL> connect sys/@ as sysdba
-> if this succeeds then it's a Problem specific on the Database where if you get the same Error here there is a general (mostly Setup or Configuration) Problem
  • ARCn- and Database Processes read the TNSNAMES.ORA only when they are started or Log Transport Services are restarted. So if you make Modifications in the TNSNAMES.ORA on the Alias for the Standby Database the ARCn-Processes do not get aware of this Change unless they are restarted or you stop and restart Log Transport Services, ie. set the corresponding log_archive_dest_state_n to 'defer' and back to 'enable'.
  • As an Alternative you can also try to directly put the Connect Identifier into the log_archive_dest_n which will avoid having to restart the ARCn-Processes or taking care of the correct TNSNAMES.ORA to be used

Here are common Errors and Solutions:
  • ORA-12514
The Standby Database is down or the specified Service you want to connect to is not registered with Listener on the Standby Site.
- Verify the Standby Database is at least in mount-Status
- Ensure the Service used to connect by Log Transport Services is registered with the correct Listener
- Review the TNSNAMES.ORA and ensure the Connection Details (Host, Protocol and Port) are correct
  • ORA-12541
The Log Transport Services cannot detect a Listener on the Destination
- Ensure the Listener is running
- Review the TNSNAMES.ORA and ensure the Connection Details (Host, Protocol and Port) are correct
- Verify in /etc/hosts-File the IP-Address given for the local Node where the Listener is running is defined with it's real IP-Address rather than the localhost Address (127.0.0.1) and it matches with the Listener IP-Address or Hostname Resolution
  • ORA-12154
The given Connect Descriptor used for Log Transport Services cannot be resolved
- Ensure the TNS-Alias setup with log_archive_dest_n exists in the TNSNAMES.ORA and is valid (Spelling, Brackets,...)
- Try to manually connect to the Standby Database using the same TNS-Alias
- Verify if you modified the TNSNAMES.ORA since the Database ARCn and LNS-Processes started; they may not be aware of the Change. So you may have to kill those so that they get respawned again
  • ORA-3135
The Communication between the ARCn and the RFS-Process died unexpectedly.
- Typical Cause are active Firewalls or Routers in the Network Connection between the Primary and Standby Database. Ensure there are no Features enabled on this Equipment which are able to modify TCP-Packets
- The Network is overloaded; ensure there is always sufficient Bandwith available to cope with the current Redo Generation Rate - also see
How To Calculate The Required Network Bandwidth Transfer Of Archivelogs In Dataguard Environments (Note 736755.1)
for Calculation
  • ORA-16191/ORA-1017
The Log Transport Services cannot authenticate on the Standby Database
- Ensure Passwordfile has been transfered to the Standby Site correctly
- REMOTE_LOGIN_PASSWORDFILE is setup correct
- Review
Troubleshooting ORA-16191 and ORA-1017 in Data Guard Log Transport Services (Note 1368170.1)
for further Troubleshooting


REFERENCES

NOTE:1269749.1 - RFS-process on physical standby database fails with Ora-00600:[Kcrrrfswda.9], [4], [368], [], [], [], [], []
NOTE:736755.1 - How To Calculate The Required Network Bandwidth Transfer Of Redo In Data Guard Environments
NOTE:1368170.1 - Troubleshooting ORA-16191 and ORA-1017/ORA-1031 in Data Guard Log Transport Services or Data Guard Broker
NOTE:799353.1 - How to Resolve Error in Remote Archiving
NOTE:1130523.1 - Logs are not shipped to the physical standby database


========================================================================

Comments