In an Oracle cluster prior to version 12.1.0.2c, when a split brain problem occurs, the node with lowest node number survives. However, starting from Oracle Database 12.1.0.2c, the node with higher weight will survive during split brain resolution.
In this article I will explore this new feature for one of the possible factors contributing to the node weight, i.e. the number of database services executing on a node.
What is Split Brain?
In a cluster, a private interconnect is used by cluster nodes to monitor each node’s status and communicate with each other. When two or more nodes fail to ping or connect to each other via this private interconnect, the cluster gets partitioned into two or more smaller sub-clusters each of which cannot talk to others over the interconnect. Oblivious of the existence of other cluster fragments, each sub-cluster continues to operate independently of the others. This is called “Split Brain”. In such a scenario, integrity of the cluster and its data might be compromised due to uncoordinated writes to shared data by independently operating nodes. Hence, to protect the integrity of the cluster and its data, the split-brain must be resolved.
How does the Oracle Grid Infrastructure Clusterware resolve a “split brain” situation?
Voting disk is used by Oracle Cluster Synchronization Services Daemon (ocssd) on each node, to mark its own attendance and also to record the nodes it can communicate with. In a “split brain” situation, voting disk is used to determine which node(s) will survive and which node(s) will be evicted.
Prior to Oracle Database 12.1.0.2c, the algorithm to determine the node(s) to be retained / evicted is as follows:
- If the sub-clusters are of the different sizes, the clusterware identifies the largest sub-cluster, and aborts all the nodes which do not belong to that sub-cluster.
- If all the sub-clusters are of the same size, the sub-cluster having the lowest numbered node survives so that, in a 2-node cluster, the node with the lowest node number will survive.
However, starting from 12.1.0.2c, in case of split brain, some improvement has been made to node eviction algorithm. In order to make largest number of resources available to the users, the node weight is computed for each node based on number of the resource executing on it and the sub-cluster with higher weight will survive.
Starting in Oracle Database 12.1.0.2c, the new algorithm to determine the node(s) to be retained / evicted is as follows:
- If the sub-clusters are of the different sizes, the functionality is same as earlier i.e. the clusterware identifies the largest sub-cluster, and aborts all the nodes which do NOT belong to that sub-cluster.
- If all the sub-clusters are of the same size, the functionality has been modified as:
- If the sub-clusters have equal node weights, the sub-cluster with the lowest numbered node in it survives so that, in a 2-node cluster, the node with the lowest node number will survive.
- If the sub-clusters have unequal node weights, the sub-cluster having the higher weight survives so that, in a 2-node cluster, the node with the lowest node number might be evicted if it has a lower weight.
Now I will demonstrate this new feature in an Oracle 12.1.0.2c standard 3 node cluster, using an RAC database called admindb for one of the possible factors contributing to the node weight, i.e. the number of database services executing on a node. Since I will only explore the scenarios for which functionality has been modified, i.e. sub-clusters are of equal size, I have shut down one of the nodes so that there are only 2 active nodes in the cluster.
Current scenario:
Name of the cluster: Cluster01.example.com
Number of nodes: 3 (host01, host02, host03)
Name of RAC database: admindb
Instances of RAC database: admindb1 on host01
admindb2 on host02
Overview
- Check that only two nodes (host01 and host02) are active and host01 has lower node number
- Create two singleton services for the RAC database admindb
- Serv1 : Preferred instance admindb1
- Serv2: Preferred instance admindb2
Case-I: Equal number of database services executing on both the nodes
- Start both the services for database admindb so that equal number of database services execute on both the nodes.
- Simulate loss of connectivity between two nodes.
- Verify that:
- host01 is retained as it has a lower node number.
- host02 is evicted.
Case-II: Unequal number of database services executing on both the nodes
- Stop the service serv1 so that host01 is not hosting any service and service serv2 executes on host02. As a result, unequal number of database services execute on both the nodes.
- Simulate loss of connectivity between two nodes.
- Verify that:
- host02 is retained as it has higher number of database services executing.
- host01 is evicted although it has a lower node number.
Demonstration
- Check that only two nodes (host01 and host02) are active and host01 has lower node number:
1 2 3 4 |
[root@host02 ~]# olsnodes -s -n host01 1 Active host02 2 Active host03 3 Inactive |
- Create two singleton services for the RAC database admindb:
- Serv1 : Preferred instance admindb1
- Serv2: Preferred instance admindb2
1 2 |
[oracle@host02 root]$ srvctl add service -s serv1 -d admindb -preferred admindb1 [oracle@host02 root]$ srvctl add service -s serv2 -d admindb -preferred admindb2 |
Case-I: Equal number of database services executing on both the nodes
We will verify that when an equal number of database services are running on both nodes, the node with lower node number (host01) survives.
- Verify that admindb is the only database in the cluster having its instances executing on host01 and host02.
1 2 3 4 5 |
[root@host02 ~]# crsctl stat res -n host01 |grep NAME | grep .db <span style="color: red;"><strong>NAME=ora.admindb.db</strong></span> [root@host02 ~]# crsctl stat res -n host02 |grep NAME | grep .db <span style="color: red;"><strong>NAME=ora.admindb.db</strong></span> |
- Start both the services for database admindb so that serv1 executes on host01 and serv2 executes on host02. As a result, equal number of database services execute on both the nodes.
1 2 3 4 5 |
[oracle@host02 root]$ srvctl start service -d admindb [oracle@host02 root]$ srvctl status service -d admindb <span style="color: red;"><strong>Service serv1 is running on instance(s) admindb1</strong></span> <span style="color: red;"><strong>Service serv2 is running on instance(s) admindb2</strong></span> |
- Find out name of Private network
1 2 3 4 |
[root@host01 ~]# oifcfg getif eth0 192.9.201.0 global public <span style="color: red;"><strong>eth1 10.0.0.0 global cluster_interconnect</strong></span> |
- To simulate loss of connectivity between two nodes, stop the private network service on one of the nodes:
1 |
[root@host01 ~]# ifdown eth1 |
- Verify that host01 is retained as it has a lower node number and host02 is evicted:
1 2 3 |
--Ocssd log of host01 [root@host01 ~]# vi /u01/app/grid/diag/crs/host01/crs/trace/ocssd.trc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
2015-12-29 15:20:44.374229 : CSSD:1126267200: <span style="color: red;"><strong>clssnmrCheckSplit: Waiting for node weights, stamp(346867999)</strong></span> 2015-12-29 15:20:44.499676 : CSSD:1079707968: clssnmvDiskPing: Writing with status 0x3, timestamp 1451382644/4294773370 2015-12-29 15:20:44.502647 : CSSD:1076521280: clssnmvDiskPing: Writing with status 0x3, timestamp 1451382644/4294773370 2015-12-29 15:20:44.502702 : CSSD:1116805440: clssnmvDiskPing: Writing with status 0x3, timestamp 1451382644/4294773370 2015-12-29 15:20:44.868385 : CSSD:1121536320: clssnmvDHBValidateNCopy: node 2, host02, has a disk HB, but no network HB, DHB has rcfg 346868000, wrtcnt, 499999, LATS 4294773740, lastSeqNo 499995, uniqueness 1451382179, timestamp 1451382644/4294615110 2015-12-29 15:20:44.868456 : CSSD:1126267200: clssnmCheckSplit: nodenum 3 curts_ms -193556 readts_ms -193556 2015-12-29 15:20:44.868462 : CSSD:1126267200: <span style="color: red;"><strong>clssnmCheckSplit: Node 3, host03 removed</strong></span> 2015-12-29 15:20:44.868496 : CSSD:1126267200: <span style="color: red;"><strong>clssnmrCheckNodeWeight: node(1) has weight stamp(346867999), pebble(2)</strong></span> 2015-12-29 15:20:44.868499 : CSSD:1126267200: <span style="color: red;"><strong>clssnmrCheckNodeWeight: node(2) has weight stamp(346867999), pebble(0)</strong></span> 2015-12-29 15:20:44.868501 : CSSD:1126267200: clssnmrCheckNodeWeight: stamp(346867999), completed(2/2) 2015-12-29 15:20:44.868517 : CSSD:1126267200: clssnmCheckDskInfo: My cohort: 1 2015-12-29 15:20:44.868520 : CSSD:1126267200: clssnmRemove: Start 2015-12-29 15:20:44.868525 : CSSD:1126267200: (:CSSNM00007:)<span style="color: red;"><strong>clssnmrRemoveNode: Evicting node 2, host02, from the cluster</strong></span> in incarnation 346868000, node birth incarnation 346867999, death incarnation 346868000, stateflags 0x224000 uniqueness value 1451382179 |
1 2 3 |
[root@host01 ~]# olsnodes -s -n host01 1 Active host02 2 Inactive |
Hence, we observed that when an equal number of database services were running on both nodes, the node with lower node number (host01) survives.
Case-II: Unequal numbers of database services executing on both the nodes
We will verify that when an unequal number of database services are running on the two nodes, the node hosting the higher number of database services survives even if it has a higher node number.
- Stop the service serv1 so that host01 is not hosting any service and service serv2 executes on host02. As a result, unequal number of database services execute on both the nodes.
1 2 3 4 5 |
[oracle@host02 root]$ srvctl stop service -s serv1 -d admindb [oracle@host02 root]$ srvctl status service -d admindb <span style="color: red;"><strong>Service serv1 is not running.</strong></span> <span style="color: red;"><strong>Service serv2 is running on instance(s) admindb2</strong></span> |
- To simulate loss of connectivity between two nodes, stop private network service on one of the nodes:
1 |
[root@host01 ~]# ifdown eth1 |
- Verify that host02 is retained as it has higher number of database services executing and host01 is evicted although it has a lower node number:
1 2 3 |
[root@host02 ~]# olsnodes -s -n host01 1 Inactive host02 2 Active |
- OCSSD Log of host02:
1 |
[root@host02 ~]# vi /u01/app/grid/diag/crs/host02/crs/trace/ocssd.trc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
2015-11-30 15:39:39.666779 : CSSD:1124809024: <span style="color: red;"><strong>clssnmrCheckSplit:</strong></span> <span style="color: red;"><strong>Waiting for node weights, </strong></span>stamp(344360122) 2015-11-30 15:39:40.058235 : CSSD:1124809024: <span style="color: red;"><strong>clssnmrCheckNodeWeight:</strong></span> <span style="color: red;"><strong>node(1) has weight stamp(0), pebble(0)</strong></span> 2015-11-30 15:39:40.058243 : CSSD:1124809024: <span style="color: red;"><strong>clssnmrCheckNodeWeight</strong></span>: <span style="color: red;"><strong>node(2) has weight stamp(344360122), pebble(1)</strong></span> 2015-11-30 15:39:40.058245 : CSSD:1124809024: clssnmrCheckNodeWeight: stamp(344360122), completed(1/2) 2015-11-30 15:39:40.058247 : CSSD:1124809024: clssnmrCheckSplit: Waiting for node weights, stamp(344360122) 2015-11-30 15:39:40.077691 : CSSD:1090804032: clssnmvDiskKillCheck: not evicted, file ORCL:ASMDISK03 flags 0x00000000, kill block unique 0, my unique 1448874242 2015-11-30 15:39:40.077791 : CSSD:1116924224: clssnmvDiskKillCheck: not evicted, file ORCL:ASMDISK01 flags 0x00000000, kill block unique 0, my unique 1448874242 2015-11-30 15:39:40.077791 : CSSD:1116924224: clssnmvDiskKillCheck: not evicted, file ORCL:ASMDISK01 flags 0x00000000, kill block unique 0, my unique 1448874242 2015-11-30 15:39:40.077923 : CSSD:1095665984: clssnmvDiskKillCheck: not evicted, file ORCL:ASMDISK02 flags 0x00000000, kill block unique 0, my unique 1448874242 2015-11-30 15:39:40.092015 : CSSD:1083697472: clssnmvDiskPing: Writing with status 0x3, timestamp 1448878180/3431154 2015-11-30 15:39:40.113021 : CSSD:1098819904: clssnmvDiskPing: Writing with status 0x3, timestamp 1448878180/3431174 2015-11-30 15:39:40.114578 : CSSD:1088051520: clssnmvDiskPing: Writing with status 0x3, timestamp 1448878180/3431174 2015-11-30 15:39:40.117006 : CSSD:1118501184: clssnmvDHBValidateNCopy: node 1, host01, has a disk HB, but no network HB, DHB has rcfg 344360123, wrtcnt, 743780, LATS 3431184, lastSeqNo 743777, uniqueness 1448877474, timestamp 1448878179/3142874 2015-11-30 15:39:40.117016 : CSSD:1118501184: clssnmvReadDskHeartbeat: manual shutdown of nodename host01, nodenum 1 epoch 1448878179 msec 3142874 2015-11-30 15:39:40.117357 : CSSD:1124809024: clssnmrCheckNodeWeight: node(2) has weight stamp(344360122), pebble(1) 2015-11-30 15:39:40.117361 : CSSD:1124809024: clssnmrCheckNodeWeight: stamp(344360122), completed(1/1) 2015-11-30 15:39:40.117376 : CSSD:1124809024: clssnmCheckDskInfo: My cohort: 2 2015-11-30 15:39:40.117379 : CSSD:1124809024: clssnmRemove: Start 2015-11-30 15:39:40.117383 : CSSD:1124809024: (:CSSNM00007:)clssnmrRemoveNode: <span style="color: red;"><strong>Evicting node 1, host01, from the cluster in incarnation 344360123,</strong></span> node birth incarnation 344360122, death incarnation 344360123, stateflags 0x225000 uniqueness value 1448877474 |
Thus, we observed that when unequal number of database services are running on the two nodes, the node with higher number of database services survives even though it has a higher node number.
Summary:
Starting from 12.1.0.2, during split brain resolution, the new algorithm followed to decide the nodes to be evicted/retained is as follows:
- If the sub-clusters are of the different sizes, the functionality is same as earlier, i.e. the clusterware identifies the largest sub-cluster, and aborts all the nodes which do
not belong to that sub-cluster - If all the sub-clusters are of the same size, the functionality has been modified as:
- If the sub-clusters have equal node weights, the sub-cluster with the lowest numbered node in it survives so that, in a 2-node cluster, the node with the lowest node number will survive.
- If the sub-clusters have unequal node weights, the sub-cluster having the higher weight survives so that, in a 2-node cluster, the node with the lowest node number might be evicted if it has a lower weight.
Load comments