大数跨境
0
0

Oracle RAC集群OCR注册信息恢复案例分享

Oracle RAC集群OCR注册信息恢复案例分享 数据库运维之道
2025-01-10
2
导读:近期帮忙客户处理过两次集群OCR表决盘数据异常问题,借此机会梳理下处理过程。本次生产环境的数据库版本是11.2.0.4,双节点部署架构。
近期帮忙客户处理过两次集群OCR表决盘数据异常问题,借此机会梳理下处理过程。本次生产环境的数据库版本是11.2.0.4,双节点部署架构。
一、工作原理
Oracle Clusterware Registry(OCR)存储了所有集群RAC数据库相关的配置信息,并且这种配置信息都是公用的交由多个节点进行共享,所以OCR会存放在共享磁盘上。

OCR是存放在共享存储中的,所以在整个集群中每个节点都通过本地的CRSD进程去访问OCR缓存在其内存中维护着的一个副本。同时为了保证OCR的完整性,Oracle不允许所有节点都能直接操作OCR,只有一个进程可以对OCR进行修改,这个进程被称为OCR Master。这个进程负责刷新自己本地的OCR cache以及集群中其他节点的OCR cache。换个方式说,每个节点的OCR客户端查询的都是本地的OCR缓存,当这个节点需要对OCR进行一些修改的时候他们将通过本地的CRSD进程与OCR Master进程进行交互来实现自己的修改。

二、常用管理命令

1、ocrdump命令使用

OCR中存储的内容也是常规的键值对的形式,整个OCR的信息是树形结构,有3个分支:SYSTEM、DATABASE和CRS。使用ocrdump命令对其进行导出或者指定参数进行指定分支导出。直接执行会在当前目录下生成OCRDUMPFILE文件。也可以指定一些参数按需求输出。

stdout 直接把内容输出到屏幕上,不生成文件Filename 把内容输出到指定文件(ocrdump.txt)keyname 只打印指定键及其子键内容xml 以xml格式输出样例:ocrdump  -keyname SYSTEM -xml ocrdump.txt
命令:ocrdump命令:cat OCRDUMPFILE
查看集群版本crsctl query crs activeversion

2、OCR健康状态检查

ocrcheck用于检查OCR内容的一致性,直接执行即可不需要参数。执行这个命令的同时也会生成一个日志文件,具体内容如下:

命令:ocrcheckfind /u01 -name ocrcheck_*/u01/app/11.2.0/grid/log/nhzhbg-db1/client/ocrcheck_88595.log
# more ocrcheck_88595.log Oracle Database 11g Clusterware Release 11.2.0.4.0 - Production Copyright 1996, 2011 Oracle. All rights reserved.2025-01-09 22:06:04.415: [OCRCHECK][1104815936]ocrcheck starts...2025-01-09 22:06:06.092: [OCRCHECK][1104815936]protchcheck: OCR status : total = [262120], used = [2956], avail = [259164]2025-01-09 22:06:27.855: [OCRCHECK][1104815936]Exiting [status=success]...

3、手动备份ocr

OCR配置 ocrconfig,需要使用root用户执行。

命令:ocrconfig -manualbackup2025/01/09 22:14:00     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/backup_20250109_221400.ocr查看ocr备份,正常自动备份4小时进行一次命令:ocrconfig -showbackup2025/01/07 04:23:03     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/backup00.ocr

4、OCR修复

健忘就是当一个节点更新了OCR上的部分信息,比如新加了一块磁盘。其他节点因为重启或者别的故障原因导致无法将信息及时收到,这样就造成了故障节点不知道OCR已经被更改。这个情况就被称为健忘。

ocrconfig -repair#修复primary ocrocrconfig -repair ocr device_name#修复mirror ocrocrconfig -repair ocrmirror device_name

5、OCR替换

ocrconfig -replaceocrconfig -replace ocrocrconfig -replace ocrmirror

三、案例受理阶段

1、监控平台监测到两台数据库服务器同时出现异常重启。登录操作系统发现集群状态已异常,OHASD、CRS、CSS、EVM服务已经异常无法通讯。

命令:crsctl check crs输出:CRS-4638: Cannot communicate with Oracle High Availability ServicesCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4530: Communications failure contacting Cluster Synchronization Services daemonCRS-4534: Cannot communicate with Event Manager
命令:ps -ef|grep d.bin 以及 ps -ef|grep ohasd
命令:crsctl stat res -t输出:CRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4000: Command Status failed, or completed with errors.

2、及时检查集群的运行日志(alert主机名.log),看到CSS集群同步服务守护进程遇到致命错误被中止。

2025-01-07 11:16:04.793[cssd(2738)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/节点1主机名/cssd/ocssd.log2025-01-07 11:16:04.910[cssd(2738)]CRS-1652:Starting clean up of CRSD resources.2025-01-07 11:16:05.409[cssd(2738)]CRS-1604:CSSD voting file is offline: /dev/oracleasm/disks/VOTE02details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/节点1主机名/cssd/ocssd.log.2025-01-07 11:16:05.409[cssd(2738)]CRS-1604:CSSD voting file is offline: /dev/oracleasm/disks/VOTE03details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/节点1主机名/cssd/ocssd.log.

3、进一步查看cssd日志,从报错堆栈里可以看到调用clssnmvReadDskHeartbeat函数后,集群就异常了。即磁盘心跳信号检测进程,检测到磁盘心跳信号失败,触发相应的处理机制,直接把节点被踢出集群,并触发服务器重启。这种机制对于保持集群的稳定性和数据的完整性至关重要。

操作系统日志:Jan  7 11:16:08 abrt-hook-ccpp: Process 2738 (ocssd.bin) of user 54322 killed by SIGABRT - dumping core
CSSD日志:2025-01-07 11:16:04.793: [    CSSD][3252147968]clssnmvReadDskHeartbeat: node 3 record written by node 22025-01-07 11:16:04.793: [    CSSD][3252147968]clssnmRemoveNodeInTerm: node 1, 节点1主机名 terminated due to Normal Shutdown. Removing from member and connected bitmaps2025-01-07 11:16:04.793: [    CSSD][3252147968]###################################2025-01-07 11:16:04.793: [    CSSD][3252147968]clssscExit: CSSD aborting from thread clssnmvWorkerThread2025-01-07 11:16:04.793: [    CSSD][3252147968]###################################2025-01-07 11:16:04.793: [    CSSD][3252147968](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally2025-01-07 11:16:04.793: [    CSSD][3252147968]
堆栈信息:Call Stack Traceclone-》start_thread-》clssscthrdmain-》clssnmvWorkerThread-》clssnmvReadDskHeartbeat-》clssscExit
4、往下查看cssd.log日志,看到三块OCR仲裁盘(表决盘)的数据出现不一致错误,其中2块盘被离线,不满足集群冗余条件。
2025-01-07 11:16:05.408: [    CSSD][3236308736]clssnmRemoveNodeInTerm: node 1, nhzhbg-db1 terminated due to Normal ShutdownRemoving from member and connected bitmaps2025-01-07 11:16:05.408: [    CSSD][3236308736]clssscExit: abort already set 12025-01-07 11:16:05.409: [    CSSD][3241056000](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 251143204 ms for voting file /dev/oracleasm/disks/VOTE02)2025-01-07 11:16:05.409: [    CSSD][3241056000]clssnmvDiskAvailabilityChange: voting file /dev/oracleasm/disks/VOTE02 now offline2025-01-07 11:16:05.409: [    CSSD][3241056000](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 1643987470 ms for voting file /dev/oracleasm/disks/VOTE03)2025-01-07 11:16:05.409: [    CSSD][3241056000]clssnmvDiskAvailabilityChange: voting file /dev/oracleasm/disks/VOTE03 now offline2025-01-07 11:16:05.409: [    CSSD][3241056000](:CSSNM00018:)clssnmvDiskCheck: Aborting1 of 3 configured voting disks available, need 22025-01-07 11:16:05.409: [    CSSD][3241056000]clssnmRemoveNodeInTerm: node 1, nhzhbg-db1 terminated due to Normal ShutdownRemoving from member and connected bitmaps2025-01-07 11:16:05.409: [    CSSD][3241056000]clssscExit: abort already set 12025-01-07 11:16:05.409: [   SKGFD][3250554624]Lib :UFS:: closing handle 0x7fc698039dc0 for disk :/dev/oracleasm/disks/VOTE02:2025-01-07 11:16:05.409: [   SKGFD][3247400704]Lib :UFS:: closing handle 0x7fc68808e1f0 for disk :/dev/oracleasm/disks/VOTE02:
四、案例处理阶段

通过多次尝试重启集群单节点,依旧无法把集群拉起来。因此需要进行OCR重建。

1、查看OCR备份信息

[root@节点1 ~]# ocrconfig -showbackupnhzhbg-db2     2025/01/07 04:23:03     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/backup00.ocrnhzhbg-db2     2025/01/07 00:23:01     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/backup01.ocrnhzhbg-db2     2025/01/06 20:23:00     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/backup02.ocrnhzhbg-db2     2025/01/06 00:22:59     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/day.ocrnhzhbg-db2     2024/12/26 04:22:24     /u01/app/11.2.0/grid/cdata/nhzhbg-cluster1/week.ocrPROT-25Manual backups for the Oracle Cluster Registry are not available

2、将3块OCR ASM盘进行移除或清理

查看当前服务器挂载磁盘信息[root@~]# lsblk NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTfd0               2:0    1    4K  0 disk sda               8:0    0   10G  0 disk └─sda1            8:1    0   10G  0 part sdb               8:16   0   10G  0 disk └─sdb1            8:17   0   10G  0 part sdc               8:32   0   10G  0 disk └─sdc1            8:33   0   10G  0 part sdd               8:48   0  500G  0 disk └─sdd1            8:49   0  500G  0 part sde               8:64   0  200G  0 disk └─sde1            8:65   0  200G  0 part sr0              11:0    1 1024M  0 rom  vda             252:0    0   80G  0 disk ├─vda1          252:1    0    1G  0 part /boot└─vda2          252:2    0   78G  0 part   ├─centos-root 253:0    0   62G  0 lvm  /  └─centos-swap 253:1    0   16G  0 lvm  [SWAP]
用dd进行清理(生产环境确认后再操作,否则数据会丢失)dd if=/dev/zero of=/dev/sda1 bs=1024k count=1024[root@ ~]# dd if=/dev/zero of=/dev/sda1 bs=1024k count=10241024+0 records in1024+0 records out1073741824 bytes (1.1 GB) copied, 1.32494 s, 810 MB/s[root@ ~]# dd if=/dev/zero of=/dev/sdb1 bs=1024k count=10241024+0 records in1024+0 records out1073741824 bytes (1.1 GB) copied, 1.33162 s, 806 MB/s[root@ ~]# dd if=/dev/zero of=/dev/sdc1 bs=1024k count=10241024+0 records in1024+0 records out1073741824 bytes (1.1 GB) copied, 1.39061 s, 772 MB/s
oracleasm deletedisk OCR1oracleasm deletedisk OCR2oracleasm deletedisk OCR3

3、创建新的ASM磁盘

# oracleasm scandisksReloading disk partitions: doneCleaning any stale ASM disks...Cleaning disk "VOTE01"Cleaning disk "VOTE02"Cleaning disk "VOTE03"Scanning system for ASM disks...
# oracleasm createdisk OCR1 /dev/sda1Writing disk header: doneInstantiating disk: done
# oracleasm createdisk OCR2 /dev/sdb1Writing disk header: doneInstantiating disk: done
# oracleasm createdisk OCR3 /dev/sdc1Writing disk header: doneInstantiating disk: done
# oracleasm scandisksReloading disk partitions: doneCleaning any stale ASM disks...Scanning system for ASM disks...
# oracleasm listdisksARCHDATAOCR1OCR2OCR3

4、以excl方式启动单节点上的集群服务

# crsctl start crs -excl -nocrsCRS-4123: Oracle High Availability Services has been started.CRS-2672: Attempting to start 'ora.mdnsd' on '节点1'CRS-2676Start of 'ora.mdnsd' on '节点1' succeededCRS-2672: Attempting to start 'ora.gpnpd' on '节点1'CRS-2676Start of 'ora.gpnpd' on '节点1' succeededCRS-2672: Attempting to start 'ora.cssdmonitor' on '节点1'CRS-2672: Attempting to start 'ora.gipcd' on '节点1'CRS-2676Start of 'ora.cssdmonitor' on '节点1' succeededCRS-2676Start of 'ora.gipcd' on '节点1' succeededCRS-2672: Attempting to start 'ora.cssd' on '节点1'CRS-2672: Attempting to start 'ora.diskmon' on '节点1'CRS-2676Start of 'ora.diskmon' on '节点1' succeededCRS-2676Start of 'ora.cssd' on '节点1' succeededCRS-2679: Attempting to clean 'ora.cluster_interconnect.haip' on '节点1'CRS-2672: Attempting to start 'ora.ctssd' on '节点1'CRS-2681: Clean of 'ora.cluster_interconnect.haip' on '节点1' succeededCRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on '节点1'CRS-2676Start of 'ora.ctssd' on '节点1' succeededCRS-2676Start of 'ora.cluster_interconnect.haip' on '节点1' succeededCRS-2672: Attempting to start 'ora.asm' on '节点1'CRS-2676Start of 'ora.asm' on '节点1' succeeded# crsctl check crsCRS-4638: Oracle High Availability Services is onlineCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4529: Cluster Synchronization Services is onlineCRS-4534: Cannot communicate with Event Manager# crsctl status res --init

5、重新创建磁盘组,记得磁盘组与旧的OCR磁盘组名称保持一致。命令操作过程中,有遇到报错。

SQL> create diskgroup VOTE normal redundancy disk '/dev/oracleasm/disks/OCR1','/dev/oracleasm/disks/OCR2','/dev/oracleasm/disks/OCR3ATTRIBUTE 'compatible.rdbms' = '11.2', 'compatible.asm' = '11.2';*ERROR at line 1:ORA-15018: diskgroup cannot be createdORA-15020: discovered duplicate ASM disk "OCR_0002"

col path for a60set linesize 1000select group_number,DISK_NUMBER,HEADER_STATUS,NAME,PATH,TOTAL_MB from v$asm_disk;GROUP_NUMBER DISK_NUMBER HEADER_STATU NAME     PATH    TOTAL_MB------------ ----------- ------------ ------------------------------ ------------------------------------------------------------ ----------   0       0 MEMBER      ORCL:ARCH   0   0       1 MEMBER      ORCL:DATA   0   0       2 PROVISIONED  ORCL:OCR1   0   0       3 PROVISIONED  ORCL:OCR2   0   0       9 MEMBER      /dev/oracleasm/disks/DATA   0   0       5 PROVISIONED  /dev/oracleasm/disks/OCR3   0   0       6 PROVISIONED  /dev/oracleasm/disks/OCR2   0   0       7 PROVISIONED  /dev/oracleasm/disks/OCR1   0   0       8 MEMBER      /dev/oracleasm/disks/ARCH   0   0       4 PROVISIONED  ORCL:OCR3   0

SQL> show parameter asm_diskstringNAME     TYPE  VALUE------------------------------------ ----------- ------------------------------asm_diskstring     string
SQL> alter system set asm_diskstring='/dev/oracleasm/disks';System altered.GROUP_NUMBER DISK_NUMBER HEADER_STATU NAME     PATH    TOTAL_MB------------ ----------- ------------ ------------------------------ ------------------------------------------------------------ ----------   3       1 MEMBER       VOTE_0001     /dev/oracleasm/disks/OCR2       10239   2       0 MEMBER       DATA_0000     /dev/oracleasm/disks/DATA      511999   3       2 MEMBER       VOTE_0002     /dev/oracleasm/disks/OCR3       10239   1       0 MEMBER       ARCH_0000     /dev/oracleasm/disks/ARCH      204799   3       0 MEMBER       VOTE_0000     /dev/oracleasm/disks/OCR1       10239

SQL> create diskgroup VOTE normal redundancy disk '/dev/oracleasm/disks/OCR1','/dev/oracleasm/disks/OCR2','/dev/oracleasm/disks/OCR3' attribute 'compatible.asm'='11.2.0.4.0', 'compatible.rdbms'='11.2.0.4.0';Diskgroup created.

6、使用ocrconfig命令进行OCR恢复。

# crconfig -restore /u01/app/11.2.0/grid/cdata/rac-cluster/backup00.ocr# ocrcheck# crsctl query css votediskLocated 0 voting disk(s).# crsctl replace votedisk +VOTESuccessful addition of voting disk 9f86cdc3ee0d4fbdbfdb0ff029c22746.Successful addition of voting disk 0022be1f0fe34ff8bf0948096e5e8e8c.Successful addition of voting disk 7f07b3bd08a64f3abfd3102b232b066c.Successfully replaced voting disk group with +VOTE.CRS-4266: Voting file(s) successfully replaced# crsctl query css votedisk##  STATE    File Universal Id                File Name Disk group--  -----    -----------------                --------- --------- 1. ONLINE   9f86cdc3ee0d4fbdbfdb0ff029c22746 (/dev/oracleasm/disks/OCR1) [VOTE] 2. ONLINE   0022be1f0fe34ff8bf0948096e5e8e8c (/dev/oracleasm/disks/OCR2) [VOTE] 3. ONLINE   7f07b3bd08a64f3abfd3102b232b066c (/dev/oracleasm/disks/OCR3) [VOTE]Located 3 voting disk(s).

7、先手动创建一个pfile文件,然后再恢复asm实例的spfile文件,命令操作过程中,有遇到报错。

# su - gridLast login: Fri Nov 22 14:05:46 CST 2024
$cat pfile.ora+ASM1.asm_diskgroups='DATA','ARCH'#Manual Mount+ASM2.asm_diskgroups='DATA','ARCH'#Manual Mount*.asm_diskstring='/dev/oracleasm/disks/*'*.asm_power_limit=1*.diagnostic_dest='/u01/app/grid'*.instance_type='asm'*.large_pool_size=12M*.memory_max_target=4294967296*.memory_target=4294967296*.remote_login_passwordfile='EXCLUSIVE'*.sga_max_size=4294967296
SQL> show parameter spfile;NAME     TYPE  VALUE------------------------------------ ----------- ------------------------------spfile     string
SQL> create spfile from pfile='/home/grid/pfile.ora';create spfile from pfile='/home/grid/pfile.ora'*ERROR at line 1:ORA-17502: ksfdcre:4 Failed to create file+VOTE/nyzhbg-cluster1/asmparameterfile/registry.253.1181765237ORA-15177: cannot operate on system aliases
SQL> create spfile from memory;File created.

8、重启数据库集群,可以看到集群服务可以正常启动。

# crsctl stop crs -f# crsctl check crsCRS-4638: Oracle High Availability Services is onlineCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4529: Cluster Synchronization Services is onlineCRS-4534: Cannot communicate with Event Manager# crsctl status res -t -init# crsctl status res -t -init

9、由于业务已经切到ADG备库,并将备库IP调整成SCAN IP。因此,这边增加修改集群IP的步骤。

# srvctl disable scan_listener# srvctl stop scan_listener# srvctl config scan# srvctl modify scan -n nhzhbg-cluster1-scan
# vi /etc/hosts# cat /etc/hostsxx.xx.xx.xx nhzhbg-cluster1-scan
# srvctl enable scan_listener# srvctl start scan_listener

10、将另一个节点上的集群服务启动起来。

# oracleasm scandisksReloading disk partitions: doneCleaning any stale ASM disks...Scanning system for ASM disks...Instantiating disk "OCR1"Instantiating disk "OCR2"Instantiating disk "OCR3"
# crsctl start crs
11、到此,数据库集群服务已正常恢复了。因此,通过重建集群仲裁盘把集群恢复起来,进一步验证由于磁盘数据不一致导致集群异常。

五、近期热门文章:
👉学习笔记-DM8达梦数据库启动过程
👉达梦数据库DM8小版本升级案例分享
👉如何利用RMAN Debug命令来诊断问题
👉达梦数据守护集群异常脑裂处理案例总结
👉Linux运维技能-du命令常见用例总结
👉Oracle 10g 备份恢复及容灾部署技术问题总结
👉TiDB 7.5 实验测试环境搭建及小插曲处理
👉DBA实验手册第3讲 运用bbed工具恢复delete误删除的数据
👉DBA实验手册第5讲 运用bbed工具恢复truncate表及反向构造段头块
全文完,希望可以帮到正在阅读的你,如果觉得有帮助,可以分享给你身边的朋友,同事,你关心谁就分享给谁,一起学习共同进步~~~

【声明】内容源于网络
0
0
数据库运维之道
数据库领域原创技术号,专注于Oracle、MySQL、TDSQL、HotDB、TiDB、达梦等数据库研究,深入数据库技术原理,分布式数据库,开源数据库,国产数据库,前沿数据库技术。
内容 50
粉丝 0
数据库运维之道 数据库领域原创技术号,专注于Oracle、MySQL、TDSQL、HotDB、TiDB、达梦等数据库研究,深入数据库技术原理,分布式数据库,开源数据库,国产数据库,前沿数据库技术。
总阅读25
粉丝0
内容50