Top 4 Reasons for Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment:
Whenever, Database Administrator face Node Reboot issue, First thing to look at should be/var/log/message and OS Watcher logs of the Database Node which was rebooted.
var/log/messages will give you an actual picture of reboot:- Exact time of restart, status of resourcelike swap and RAM etc.
So, Every time you see a node eviction start investigation with /var/log/messages and Analyze OS Watcher logs. Below is a situation when a Database Node was reboot due to high load.
/var/log/messages output from a Database Node just before Node eviction:
Here is a scenario for voting disk not reachable:
A. Connection to the voting disk is interrupted.
B. if only one voting disk is in use and version is less than 11.2.0.3.4, hitting known bug 13869978.
How to Solve Voting Disk Outage ?
There could be many reasons for voting disk is not reachable, Here are few general approach for DBA to follow.
3. Missed Network Connection between Nodes: In technical term this is called asMissed Network Heartbeat (NHB). Whenever there is communication gap or no communication between nodes on private network (interconnect) due to network outage or some other reason. A node abort itself to avoid "split brain" situation. The most common (but not exclusive) cause of missed NHB is network problems communicating over the private interconnect.
Suggestion to troubleshoot Missed Network Heartbeat.
The Cluster Ready Services Daemon (crsd) Log Files
CRS home/log/hostname/crsd
Event Manager (EVM) Log Files
Event Manager (EVM) information generated by
evmd
is recorded in log files in the following locations:CRS Home/log/hostname/evmd
RACG Log Files
The Oracle RAC high availability trace files are located in the following two locations:
CRS home/log/hostname/racg
$ORACLE_HOME/log/hostname/racg
- $ORA_CRS_HOME/crs/log Contains trace files for the CRS resources.
- $ORA_CRS_HOME/crs/init Contains trace files of the CRS daemon during startup. Good place to start with any CRS login problems.
- $ORA_CRS_HOME/css/log The Cluster Synchronization (CSS) logs indicate all actions such as reconfigurations, missed check-ins, connects, and disconnects from the client CSS listener. In some cases, the logger logs messages with the category of auth.crit for the reboots done by Oracle. This could be used for checking the exact time when the reboot occurred.
- $ORA_CRS_HOME/css/init Contains core dumps from the Oracle Cluster Synchronization Service daemon (OCSSd) and the process ID (PID) for the CSS daemon whose death is treated as fatal. If abnormal restarts for CSS exist, the core files will have the format of core.
. - $ORA_CRS_HOME/evm/log Log files for the Event Volume Manager (EVM) and evmlogger daemons. Not used as often for debugging as the CRS and CSS directories.
- $ORA_CRS_HOME/evm/init PID and lock files for EVM. Core files for EVM should also be written here.
- $ORA_CRS_HOME/srvm/log Log files for Oracle Cluster Registry (OCR), which contains the details at the Oracle cluster level.
- $ORA_CRS_HOME//log Log files for Oracle Clusterware (known as the cluster alert log), which contains diagnostic messages at the Oracle cluster level. This is available from Oracle database 10g R2
var/log/messages will give you an actual picture of reboot:- Exact time of restart, status of resourcelike swap and RAM etc.
1. High Load on Database Server: Out of 100 Issues, I have seen 70 to 80 time High load on the system was reason for Node Evictions, One common scenario is due to high load RAM and SWAP space of DB node got exhaust and system stops working and finally reboot.
So, Every time you see a node eviction start investigation with /var/log/messages and Analyze OS Watcher logs. Below is a situation when a Database Node was reboot due to high load.
/var/log/messages output from a Database Node just before Node eviction:
Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA: 2*4kB 1*8kB 0*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15792kB Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA32: 150*4kB 277*8kB 229*16kB 186*32kB 51*64kB 65*128kB 82*256kB 13*512kB 3*1024kB 3*2048kB 78*4096kB = 380368kB Apr 23 08:15:04 remotedb06 kernel: Node 0 Normal: 12362*4kB 58*8kB 0*16kB 0*32kB 0*64kB 0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 52984kB Apr 23 08:15:09 remotedb06 kernel: 83907 total pagecache pages Apr 23 08:15:11 remotedb06 kernel: 39826 pages in swap cache Apr 23 08:15:11 remotedb06 kernel: Swap cache stats: add 30820220, delete 30780387, find 18044378/19694662 Apr 23 08:15:14 remotedb06 kernel: Free swap = 4kB Apr 23 08:15:15 remotedb06 kernel: Total swap = 25165816kB Apr 23 08:15:16 remotedb06 kernel: 25165808 pages RAM Apr 23 08:15:28 remotedb06 kernel: 400673 pages reserved Apr 23 08:15:30 remotedb06 kernel: 77691135 pages shared Apr 23 08:15:31 remotedb06 kernel: 9226743 pages non-shared Apr 23 08:15:33 remotedb06 kernel: osysmond.bin: page allocation failure. order:4, mode:0xd0
From above message, we can see that this system has only 4kB free swap out of 24G swap space. This means system neither has RAM not SWAP for processing, which case a reboot. This picture is also clear from OS Watcher of system.
08:15:29 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
08:15:29 all 0.67 0.00 82.60 16.60 0.00 0.06 0.00 0.07 42521.98
08:15:35 all 0.84 0.00 23.40 73.19 0.00 0.11 0.00 2.47 35859.59
08:15:39 all 1.22 0.00 85.13 13.47 0.00 0.13 0.00 0.04 40569.31
08:15:45 all 1.57 0.00 98.22 0.13 0.00 0.08 0.00 0.00 36584.31
08:15:50 all 1.41 0.00 98.48 0.04 0.00 0.07 0.00 0.00 36643.10
08:15:54 all 0.84 0.00 99.09 0.03 0.00 0.05 0.00 0.00 36257.02
08:16:06 all 0.95 0.00 98.88 0.09 0.00 0.08 0.00 0.00 39113.15
08:16:11 all 0.87 0.00 99.00 0.07 0.00 0.06 0.00 0.00 37490.22
08:16:16 all 0.89 0.00 98.97 0.07 0.00 0.07 0.00 0.00 37681.04
08:16:22 all 0.78 0.00 99.12 0.05 0.00 0.05 0.00 0.00 36963.75
08:16:38 all 0.79 0.00 98.86 0.28 0.00 0.08 0.00 0.00 36639.21
08:16:43 all 0.78 0.00 98.79 0.34 0.00 0.08 0.00 0.01 37405.99
08:16:54 all 1.06 0.00 98.71 0.12 0.00 0.11 0.00 0.00 38102.37
08:17:08 all 1.69 0.00 67.02 30.93 0.00 0.06 0.00 0.29 37316.55
How to avoid Node Reboot due to High Load ?
The simple and best way to avoid this is use Oracle Database Resource Manager (DBRM). DBRM help to resolve this by allowing the database to have more control over how hardware resources and their allocation. DBA should setup Resource consumer group and Resource plan and should use them as per requirements. In Exadata system Exadata DBA can use IORM to setup resourceallocation among multiple database instances.
2. Voting Disk not Reachable: One of the another reason for Node Reboot is clusterware is not able to access a minimum number of the voting files.When the node aborts for this reason, the node alert log will show CRS-1606 error.
Here is a scenario for voting disk not reachable:
2013-01-26 10:15:47.177 [cssd(3743)]CRS-1606:The number of voting files available, 1, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/apdbc76n1/cssd/ocssd.logThere could be two reasons for this issue:
A. Connection to the voting disk is interrupted.
B. if only one voting disk is in use and version is less than 11.2.0.3.4, hitting known bug 13869978.
How to Solve Voting Disk Outage ?
There could be many reasons for voting disk is not reachable, Here are few general approach for DBA to follow.
1. Use command "crsctl query css votedisk" on a node where clusterware is up to get a list of all the voting files.
2. Check that each node can access the devices underlying each voting file.
3. Check for permissions to each voting file/disk have not been changed.
4. Check OS, SAN, and storage logs for any errors from the time of the incident.
5. Apply fix for 13869978 if only one voting disk is in use. This is fixed in 11.2.0.3.4 patch set and above, and 11.2.0.4 and above
If any voting files or underlying devices are not currently accessible from any node, work with storage administrator and/or system administrator to resolve it at storage and/or OS level.
Suggestion to troubleshoot Missed Network Heartbeat.
1. Check OS statistics from the evicted node from the time of the eviction. DBA can use OS Watcher to look at OS Stats at time of issue, check oswnetstat and oswprvtnet for network related issues.
2. Validate the interconnect network setup with the Help of Network administrator.
3. Check communication over the private network.
4. Check that the OS network settings are correct by running the RACcheck tool.
4. Database Or ASM Instance Hang: Sometimes Database or ASM instance hang can cause Node reboot. In these case Database instance is hang and is terminated afterwards, which cause either reboot cluster or Node eviction. DBA should check alert log of Database and ASM instance for any hang situation which might cause this issue.
Database Alert log file entry for Database Hang Situation:
Wed Jun 05 03:25:01 2013 Sweep [inc][122970]: completed System State dumped to trace file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/incident/incdir_122970/remotedb2_m005_118200_i122970_a.trc Wed Jun 05 03:25:33 2013 DSKM process appears to be hung. Initiating system state dump. Wed Jun 05 03:25:33 2013 System state dump requested by (instance=2, osid=9757 (GEN0)), summary=[system state dump request (ksz_check_ds)]. System State dumped to trace file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_diag_9759.trc Wed Jun 05 03:27:04 2013 Errors in file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_pmon_9707.trc (incident=122897): ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x143642090], [3], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/remotedb/remotedb2/incident/incdir_122897/remotedb2_pmon_9707_i122897.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_pmon_9707.trc: ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x143642090], [3], [], [], [], [], [], [], [], [], [] PMON (ospid: 9707): terminating the instance due to error 472 Wed Jun 05 03:27:05 2013 System state dump requested by (instance=2, osid=9707 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_diag_9759.trc Instance terminated by PMON, pid = 9707 Wed Jun 05 03:45:58 2013 Starting ORACLE instance (normal)
At the same time resources at Cluster level start failing and node was evicted by itself. Real Application Cluster Log files
[/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5011:Check of resource "PCAB1DE" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log" 2013-06-05 03:27:06.649 [/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5011:Check of resource "P1PANIE" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log" 2013-06-05 03:32:41.360 [/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5011:Check of resource "P1ESXPE" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log" 2013-06-05 03:36:25.675 [cssd(67762)]CRS-1662:Member kill requested by node exs2db03 for member number 1, group DBPCAB1DE 2013-06-05 03:36:58.775 [/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5818:Aborted command 'clean' for resource 'ora.pcab1de.db'. Details at (:CRSAGF00113:) {0:6:28} in /u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log. 2013-06-05 03:37:06.784 [/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5818:Aborted command 'clean' for resource 'ora.p1panie.db'. Details at (:CRSAGF00113:) {0:6:29} in /u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log. 2013-06-05 03:37:26.307 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log. 2013-06-05 03:37:26.318 [cssd(67762)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log 2013-06-05 03:37:26.318 [cssd(67762)]CRS-1652:Starting clean up of CRSD resources. 2013-06-05 03:37:26.340 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log. 2013-06-05 03:37:26.450 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log. 2013-06-05 03:37:26.647 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log. 2013-06-05 03:37:26.869 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log. 2013-06-05 03:37:26.921The main reason for this issue was "DSKM process appears to be hung.", but definately this was not because of high load Since OS Watcher at this time doesn't show any load.
So, I believe this could be due to some bug in database.
In few of the cases, bugs could be the reason for node reboot, bug may be at Database level, ASM level or at Real Application Cluster level. Here, after initial investigation from Database Administrator side, DBA should open an SR with Oracle Support.
Denizli
ReplyDeleteKonya
Denizli
ısparta
Bayburt
E0D
ankara parça eşya taşıma
ReplyDeletetakipçi satın al
antalya rent a car
antalya rent a car
ankara parça eşya taşıma
SMWYC
A1990
ReplyDeleteOrdu Evden Eve Nakliyat
Bayburt Evden Eve Nakliyat
Tunceli Evden Eve Nakliyat
Sakarya Evden Eve Nakliyat
Silivri Duşa Kabin Tamiri
FF0F8
ReplyDeleteSamsun Lojistik
Bitfinex Güvenilir mi
Antalya Rent A Car
Balıkesir Şehirler Arası Nakliyat
Silivri Parke Ustası
Muğla Evden Eve Nakliyat
Muş Evden Eve Nakliyat
Silivri Çatı Ustası
Amasya Lojistik
6664F
ReplyDeleteKarapürçek Fayans Ustası
Paribu Güvenilir mi
Iğdır Parça Eşya Taşıma
Çorum Parça Eşya Taşıma
Samsun Lojistik
Bitget Güvenilir mi
Nevşehir Lojistik
Çerkezköy Sineklik
Ünye Çatı Ustası
8C776
ReplyDeleteBolu Şehir İçi Nakliyat
Edirne Parça Eşya Taşıma
Şırnak Şehir İçi Nakliyat
Sincan Fayans Ustası
Bolu Şehirler Arası Nakliyat
Erzurum Şehir İçi Nakliyat
Iğdır Şehirler Arası Nakliyat
Elazığ Parça Eşya Taşıma
Erzurum Parça Eşya Taşıma
7C867
ReplyDeleteGümüşhane Şehirler Arası Nakliyat
Çerkezköy Motor Ustası
Çerkezköy Mutfak Dolabı
Kayseri Evden Eve Nakliyat
Bayburt Parça Eşya Taşıma
Kırşehir Parça Eşya Taşıma
Sakarya Şehirler Arası Nakliyat
Altındağ Fayans Ustası
Coin Nedir
0639A
ReplyDeleteMuğla Lojistik
Erzincan Şehir İçi Nakliyat
Kars Şehirler Arası Nakliyat
Binance Referans Kodu
Bitranium Coin Hangi Borsada
Samsun Lojistik
İzmir Şehir İçi Nakliyat
Karapürçek Boya Ustası
Rize Şehirler Arası Nakliyat
001AE
ReplyDeleteAlya Coin Hangi Borsada
Bilecik Evden Eve Nakliyat
Tokat Şehir İçi Nakliyat
Bursa Şehir İçi Nakliyat
Erzincan Evden Eve Nakliyat
Giresun Lojistik
Isparta Evden Eve Nakliyat
Bayburt Şehirler Arası Nakliyat
Sinop Şehir İçi Nakliyat
D9EBF
ReplyDeleteKırşehir Lojistik
Sinop Evden Eve Nakliyat
Arg Coin Hangi Borsada
Denizli Parça Eşya Taşıma
Kilis Lojistik
Yenimahalle Parke Ustası
Tokat Evden Eve Nakliyat
Konya Şehir İçi Nakliyat
Mexc Güvenilir mi
9DE0D
ReplyDeleteMuğla Parça Eşya Taşıma
Çanakkale Evden Eve Nakliyat
Kayseri Şehir İçi Nakliyat
Artvin Şehir İçi Nakliyat
Amasya Şehirler Arası Nakliyat
Muş Parça Eşya Taşıma
Iğdır Şehirler Arası Nakliyat
Aydın Şehir İçi Nakliyat
Kilis Şehir İçi Nakliyat
659FC
ReplyDeleteBitlis Evden Eve Nakliyat
Elazığ Evden Eve Nakliyat
Ünye Marangoz
Denizli Şehir İçi Nakliyat
Yozgat Parça Eşya Taşıma
Yalova Parça Eşya Taşıma
Çerkezköy Cam Balkon
Yenimahalle Parke Ustası
Erzurum Evden Eve Nakliyat
D152E
ReplyDeletesamsun canlı sohbet et
nanytoo sohbet
bursa en iyi ücretsiz görüntülü sohbet siteleri
şırnak görüntülü sohbet sitesi
rize sesli sohbet mobil
hakkari mobil sohbet bedava
mersin bedava görüntülü sohbet sitesi
malatya görüntülü sohbet kadınlarla
ısparta canlı görüntülü sohbet siteleri
9EE81
ReplyDeleteBinance Madenciliği Nedir
Btcturk Borsası Güvenilir mi
Arg Coin Hangi Borsada
Tiktok Takipçi Hilesi
Dlive Takipçi Hilesi
Binance Referans Kodu
Periscope Beğeni Satın Al
Twitter Retweet Hilesi
Bitcoin Madenciliği Siteleri
91B73
ReplyDeleteTiktok Beğeni Hilesi
Threads İzlenme Hilesi
Görüntülü Sohbet Parasız
Mexc Borsası Güvenilir mi
Coin Nasıl Çıkarılır
Sui Coin Hangi Borsada
Coin Madenciliği Siteleri
Periscope Beğeni Hilesi
Tiktok Takipçi Satın Al
شركة عزل اسطح بالهفوف zi7MmQef25
ReplyDelete