ohasd.bin Core dump, Unable To Start HAS After Kernel Panic

ohasd.bin Core dump, Unable To Start HAS After Kernel Panic (Doc ID 1233023.1)



APPLIES TO:

Oracle Server - Enterprise Edition - Version: 11.2.0.1 and later   [Release: 11.2 and later ]
Information in this document applies to any platform.

SYMPTOMS

Oracle Restart (single instance) with Grid Infrastructure and ASM 11.2.0.1 on Solaris sparc 5.10, the server crashed due to a kernel panic. After server restart, HAS stack are not starting, ohasd.bin keeps core dump. This causes ASM and the database can not be started.

ohasdOUT.log 


shows:

2010-09-09 12:06:16
Changing directory to /u01/app/grid/product/11.2.0/grid/log//ohasd
OHASD starting
...
2010-09-09 12:06:16
OHASD reboot2010-09-09 12:06:20
OHASD handling signal 6
Dumping OHASD state
2010-09-09 12:06:20
Dumping OHASD stack trace
2010-09-09 12:06:20

----- Call Stack Trace ----- (simplified version)

sskgds_getcall: WARNING! *** STACK TRACE TRUNCATED ***
sskgds_getcall: WARNING! *** UNREADABLE FRAME FOUND ***
sclssutl_sigdump()+ 628
sclssutl_signalhandler()+104
__sighndlr()+12
call_user_handler()+992
__lwp_kill()+8
raise()+16
abort()+208
__1cH__CimplRdefault_terminate6F_v_()+4
__1cH__CimplMex_terminate6F_v_()+36
_ex_throw_body()+144
__1cG__CrunIex_throw6Fpvpkn0AQstatic_type_info_pF1_v_v_()+116
__1cDCAAKOwnerEntry2t5B6MrknDstdMbasic_string4Ccn0CLchar_traits4Cc__n0CJallocator4Cc_____v_()+6+660
__1cDCAADAclIaddEntry6MnDstdMbasic_string4Ccn0CLchar_traits4Cc__n0CJallocator4Cc_____v_()+88
__1cDCAADAcl2t5B6MrknDstdMbasic_string4Ccn0CLchar_traits4Cc__n0CJallocator4Cc____rk3r6b_v_()+1076
......

A core dump is generated is /var/core directory like:
/var/core/core__ohasd.bin_1506_1500_1284075685_22100

ohasd.log shows:

2010-09-09 12:06:16.447: [ default][1] OHASD Daemon Starting. Command string :reboot
2010-09-09 12:06:16.535: [ default][1] Initializing OLR
...
2010-09-09 12:06:17.269: [ CRSPE][34] PE MASTER NAME:
2010-09-09 12:06:17.269: [ CRSPE][34] Starting to read configuration
2010-09-09 12:06:17.289: [ CRSPE][34] Reading (1) servers
2010-09-09 12:06:17.349: [ CRSPE][34] DM: set global config version to: 58
2010-09-09 12:06:17.350: [ CRSPE][34] DM: set pool freeze timeout to: 60000
2010-09-09 12:06:17.350: [ CRSPE][34] DM: Set event seq number to: 200000
2010-09-09 12:06:17.350: [ CRSPE][34] DM: Set threshold event seq number to: 280000
2010-09-09 12:06:17.350: [ CRSPE][34] Sent request to write event sequence number 300000 to repository
2010-09-09 12:06:17.671: [ CRSPE][34] Wrote new event sequence to repository
2010-09-09 12:06:17.823: [ CRSPE][34] Reading (12) types
2010-09-09 12:06:17.851: [ CRSPE][34] Reading (1) server pools
2010-09-09 12:06:17.918: [ CRSPE][34] Reading (17) resources
2010-09-09 12:06:20.039: [ CRSPE][34] Finished reading configuration. Parsing...
2010-09-09 12:06:20.039: [ CRSPE][34] Parsing resource types...
2010-09-09 12:06:20.192: [ default][34] Dump State Starting ...
2010-09-09 12:06:20.193: [ CRSPE][34] Dumping PE Data Model...:DM has [0 resources][0 types][0 servers][0 spools]
------------- RESOURCES:

------------- TYPES:

------------- SERVERS:

------------- SERVER POOLS:

2010-09-09 12:06:20.193: [ CRSPE][34] Dumping ICE contents...:ICE operation count: 0
2010-09-09 12:06:20.193: [ default][34] Dump State Done.

CHANGES

Node rebooted due to kernel panic

CAUSE

OLR is corrupted due to kernel panic.
Core dump happened at resource types parsing. If there is no issue with OLR, it should print message like:

2010-07-22 15:45:33.673: [ CRSPE][34] Parsing resource types...
2010-07-22 15:45:33.779: [ CRSPE][34] Resource Types parsed 

SOLUTION

Restore the OLR from automatic backup as grid user:

1. Locate automatic OLR backup:
ls -l $GRID_HOME/cdata//

2. Rename the current OLR file
mv $GRID_HOME/cdata/localhost/.olr $GRID_HOME/cdata/localhost/.olr.old

3. Touch the new OLR file:
touch $GRID_HOME/cdata/localhost/.olr

4. Restore the OLR:
ocrconfig -local -restore $GRID_HOME/cdata//backup_20100722_153230.olr

5. Start HAS stack:
crsctl start has

6. Check resource status:
  crsctl stat res -t
check if ora.cssd and ora.diskmon is ONLINE, if not, start them manually:
  crsctl start res ora.cssd

7. Depend on the resources (ASM/Listener/Database) registered in OLR, either start them or register them to OLR via srvctl command.


======================================================================


Comments