High availability of ES cluster

Keywords: ElasticSearch network kafka Nginx

Preface: articles before us ELK+Kafka+Filebeat analysis Nginx log The configuration of ES cluster is mentioned in. In my later experiments, I found that although the cluster can be started and can process data normally, as long as the ES1 node is hung up, the whole cluster can not operate normally and the data can not be processed, so the highly available effect cannot be achieved. Let's solve this problem.

Let's look at our previous configuration:
ES1 node:

[root@es1 ~]# cat /opt/soft/elasticsearch-7.6.0/elasticsearch.yml
#Cluster name
cluster.name: my-app
#Node name
node.name: es1
#Is it a management node
node.master: true
#Is it a data node
node.data: true
#Data storage path
path.data: /var/es/data
#Log storage path
path.logs: /var/es/logs
#IP of current node
network.host: 10.1.1.7
#Port the current node works on
http.port: 9200
#Communication port
transport.tcp.port: 9300
#Choose which primary node to use to initialize the cluster and actively discover other nodes, which must be consistent with node.name
cluster.initial_master_nodes: ["es1"]
#IP and communication ports discovered automatically during cluster initialization
discovery.zen.ping.unicast.hosts: ["10.1.1.7:9300","10.1.1.8:9300", "10.1.1.9:9300"]
#This attribute is defined to form a cluster. The minimum number of nodes with master node qualification and connected to each other is (N/2)+1, so as to prevent multiple master nodes from appearing in the cluster brain crack
discovery.zen.minimum_master_nodes: 2

ES2 configuration:

cluster.name: my-app
node.name: es2
node.master: true
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 10.1.1.8
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["10.1.1.7:9300","10.1.1.8:9300", "10.1.1.9:9300"]
discovery.zen.minimum_master_nodes: 2 

ES3 configuration:

cluster.name: my-app
node.name: es3
node.master: false
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 10.1.1.9
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["10.1.1.7:9300","10.1.1.8:9300", "10.1.1.9:9300"]
discovery.zen.minimum_master_nodes: 2

After cluster startup:

ip          heap.percent ram.percent cpu  load_1m load_5m load_15m node.role master name
10.1.1.7           14          94     0    0.06    0.03     0.05   dilm       *      es1
10.1.1.8           12          97     3    0.14    0.09     0.11   dilm       -      es2
10.1.1.9           12          93     0    0.32    0.08     0.07   dil        -      es3

The current problem is that when the 10.1.1.7 node is hung up, the cluster will fail.

What's the problem?
First of all, we need to ensure the number of qualified masters. For a complete production cluster, at least three nodes should be qualified as masters. That is to say, node.master:true is configured in at least three node configuration files.
In this way, the number of discovery.zen.minimum  master  nodes participating in the election can be configured as (3 / 2) + 1 = 2, that is to say, at least two node.master nodes participate in the election, and then a new master can be selected. The above reason is that there are only three nodes in the whole, among which only two have become the master, and the discovery.zen.minimum  master  nodes configured in our configuration file are also 2, When we stop one node. Master, there is only one node. Master left in the whole cluster, which can't meet the requirement that at least two nodes. Master participate in the election, so the cluster goes down.
OK, now we know the basic requirements, that is, at least three nodes. Master and at least two participating in the election. Now, build a new cluster:
ES1:

[root@es1 ~]# cat /etc/elasticsearch/elasticsearch.yml 
cluster.name: my-app
node.name: es1
node.master: true
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 192.168.1.8
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["192.168.1.8:9300","192.168.1.9:9300", "192.168.1.10:9300","192.168.1.12:9300"]
discovery.zen.minimum_master_nodes: 2 
discovery.zen.ping_timeout: 100s

ES2:

[root@es2 ~]# cat /etc/elasticsearch/elasticsearch.yml 
cluster.name: my-app
node.name: es2
node.master: true
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 192.168.1.9
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["192.168.1.8:9300","192.168.1.9:9300", "192.168.1.10:9300","192.168.1.12:9300"]
discovery.zen.minimum_master_nodes: 2 
discovery.zen.ping_timeout: 100s

ES3:

[root@es3 ~]# cat /etc/elasticsearch/elasticsearch.yml 
cluster.name: my-app
node.name: es3
node.master: true
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 192.168.1.10
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["192.168.1.8:9300","192.168.1.9:9300", "192.168.1.10:9300","192.168.1.12:9300"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s

ES4:

[root@es4 ~]# cat /etc/elasticsearch/elasticsearch.yml 
cluster.name: my-app
node.name: es4
node.master: false
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 192.168.1.12
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["192.168.1.8:9300","192.168.1.9:9300", "192.168.1.10:9300","192.168.1.12:9300"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s

After the four nodes are started one by one, you can see in the logs of ES1 node:

[2020-03-16T13:11:53,469][INFO ][o.e.c.s.MasterService    ] [es1] node-join[{es2}{CgceuyGlQ2GTcCGqegSz3Q}{oxQ-79VaQxulg07SejH0iQ}{192.168.1.9}{192.168.1.9:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 1, version: 20, delta: added {{es2}{CgceuyGlQ2GTcCGqegSz3Q}{oxQ-79VaQxulg07SejH0iQ}{192.168.1.9}{192.168.1.9:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true}}
[2020-03-16T13:11:55,058][INFO ][o.e.c.s.ClusterApplierService] [es1] added {{es2}{CgceuyGlQ2GTcCGqegSz3Q}{oxQ-79VaQxulg07SejH0iQ}{192.168.1.9}{192.168.1.9:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true}}, term: 1, version: 20, reason: Publication{term=1, version=20}
[2020-03-16T13:11:55,084][INFO ][o.e.c.s.MasterService    ] [es1] node-join[{es3}{i-wCmfCESsCNDr6Vw50Aew}{rxtD1oqtQniz62voo7MPUg}{192.168.1.10}{192.168.1.10:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true} join existing leader, {es4}{L221OFajR8-FlaIIYg37Qw}{zvHrSewbQvq3OO14zKHCpg}{192.168.1.12}{192.168.1.12:9300}{dil}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 1, version: 21, delta: added {{es4}{L221OFajR8-FlaIIYg37Qw}{zvHrSewbQvq3OO14zKHCpg}{192.168.1.12}{192.168.1.12:9300}{dil}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true},{es3}{i-wCmfCESsCNDr6Vw50Aew}{rxtD1oqtQniz62voo7MPUg}{192.168.1.10}{192.168.1.10:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true}}
[2020-03-16T13:11:56,347][INFO ][o.e.c.s.ClusterApplierService] [es1] added {{es4}{L221OFajR8-FlaIIYg37Qw}{zvHrSewbQvq3OO14zKHCpg}{192.168.1.12}{192.168.1.12:9300}{dil}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true},{es3}{i-wCmfCESsCNDr6Vw50Aew}{rxtD1oqtQniz62voo7MPUg}{192.168.1.10}{192.168.1.10:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true}}, term: 1, version: 21, reason: Publication{term=1, version=21}

As can be seen from the log, the remaining three nodes have been added to a cluster initialized by ES1. Now, you can view the status of each node of the cluster by visiting http://192.168.1.8:9200/_cat/nodes?v:

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.1.10            8          93   0    0.00    0.13     0.25 dilm      -      es3
192.168.1.8             8          92   0    0.00    0.12     0.24 dilm      *      es1
192.168.1.12           11          93   0    0.00    0.12     0.22 dil       -      es4
192.168.1.9             7          92   0    0.00    0.12     0.24 dilm      -      es2

Now we stop the ES1 node:

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.1.10            9          93   0    0.00    0.08     0.22 dilm      *      es3
192.168.1.12            9          93   0    0.04    0.09     0.20 dil       -      es4
192.168.1.9            11          92   0    0.00    0.08     0.21 dilm      -      es2

Now you see that the master node is transferred to ES3.
The basic ES cluster is built with high availability.

Published 8 original articles, won praise 3, visited 2220
Private letter follow

Posted by ryanhowdy on Sun, 15 Mar 2020 23:18:03 -0700