Ubuntu 16.x server installed Java, Elastic search 5.4.X, Chinese word segmentation, synonyms, Logstash 5.4.X log collection

Keywords: Attribute ElasticSearch Hadoop Java

Environmental Science
Ubuntu 16.x server
Memory: minimum 8G
lanmps environment Suite http://www.lanmps.com)
PHP Version: 5.6
MYSQL Version: 5.6
NGINX Version: Latest
Elastic search version: 5.4
Logstash version: 5.4

JAVA installation

One way

The Java version used here is 1.8.0_131
Install the Java version and download the corresponding version according to the tutorial (Method 3: Source installation)
http://blog.csdn.net/fenglailea/article/details/26006647#t6

If the JAVA installation above is unsuccessful, use the following installation

Mode two

If the installation of JAVA in Mode 1 is unsuccessful, use the following installation
Reporting the following error

     Error: could not find libjava.so
     Error: Could not find Java SE Runtime Environment.

It can't be solved after finding N methods.
First delete the JAVA settings environment variable configuration file in Mode 1, and then proceed to the following settings

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo update-java-alternatives -s java-8-oracle

sudo apt-get install Oracle-java8-set-default  #Setting environment variables

JAVA version

java -version

This way comes from http://blog.csdn.net/blueheart20/article/details/50121691

Server User Creation

Create hadoop users
Line by line

useradd -m hadoop -s /bin/bash     # Create hadoop users
passwd hadoop          # Modify the password, which will allow you to enter the password twice.
usermod -G root hadoop    # Increase Administrator Privileges

Setting Administrator or User Group Permissions
Executive order

visudo

Add a hadoop line to the root line, as shown below

root    ALL=(ALL)       ALL
hadoop    ALL=(ALL)       ALL

Application settings

Method 1: Exit the current user, log in with hadoop, and use the command su -, you can get root privilege to operate.
Method 2: Restart the system

The following configurations are all for hadoop users

root user is unable to start elastic search

Elastic search installation and configuration

http://kibana.logstash.es/content/elasticsearch/
https://es.xiaoleilu.com/

Official documents: https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html
The wind is coming. fox

Server description

If installed on the server
Minimum 8G memory,
Change the configuration in the config/jvm.xx file if you have lower memory

Do not use root users.
Do not use root users.
Do not use root users

Use the hadoop user created above

Elasticsearch Download Address

https://www.elastic.co/downloads/elasticsearch

Currently the latest version 5.4

Users using hadoop

cd ~
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.4.0.tar.gz

decompression

tar zxvf elasticsearch-5.4.0.tar.gz

Configure elastic search

Edit config/elastic search.yml

cd elasticsearch-5.4.0
vim config/elasticsearch.yml

Modified to

network.host: 0.0.0.0
cluster.name: es   

cluster.name can be set without setting
... Other parts have not been changed and need not be modified.

Environment variable settings

sudo vim /etc/profile.d/elasticsearch.sh

join

export ES_HOME=/home/hadoop/elasticsearch-5.4.0
export PATH=$ES_HOME/bin:$PATH

Application effective

. /etc/profile
. /etc/bashrc

start-up

cd elasticsearch-5.4.0
bin/elasticsearch     #foreground
//or
bin/elasticsearch -d  #Background operation

Close

Find process ID

ps -ef |grep elasticsearch

Find this ID, KILL, he

kill -9  id

Chinese word segmentation plug-in analysis-ik

https://github.com/medcl/elasticsearch-analysis-ik/releases

Version: 5.4.0
Users using hadoop

cd ~
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.4.0/elasticsearch-analysis-ik-5.4.0.zip
unzip elasticsearch-analysis-ik-5.4.0.zip -d elasticsearch-analysis-ik-5.4.0

Copy to the plug-in directory

mv elasticsearch-analysis-ik-5.4.0 elasticsearch-5.4.0/plugins/analysis-ik

At this point, you need to restart the elastic search plug-in to take effect (this can wait until the lexicon is set up and restart can also)

Segmentation Lexicon Settings

Enter the elastic search installation directory
Editing Lexicon Profile

cd ~/elasticsearch-5.4.0
vim plugins/analysis-ik/config/IKAnalyzer.cfg.xml 

ext_dict This line is modified to behave as follows

<entry key="ext_dict">custom/sougou.dic;custom/mydict.dic;custom/single_word_low_freq.dic;custom/product.dic</entry>

custom/product.dic is my lexicon. It's not easy to reveal here.
At this point, you need to restart the elastic search plug-in to take effect.

Hot Updating of Word Segmentation

If you set up hot updates, configure the following settings (files in the web site do not exist, here is just a case)

<! - Users can configure a remote extended dictionary here - >.
<entry key="remote_ext_dict">http://www.foxwho.com/thesaurus/word.txt</entry>

In the file, UTF8 encodes one line at a time and wraps lines with \n

Official statement

https://github.com/medcl/elasticsearch-analysis-ik
The plug-in currently supports hot update IK participle through the following configuration mentioned in the IK configuration file above

  <! - Users can configure a remote extended dictionary here - >.
    <entry key="remote_ext_dict">location</entry>
    <! - Users can configure a remote extended stop word dictionary here - >.
    <entry key="remote_ext_stopwords">location</entry>

Where location refers to a url, such as http://yoursite.com/getCustomDict The request only needs to satisfy the following two points to complete the hot update of segmentation.

The http request needs to return two headers, one is Last-Modified and the other is ETag, both of which are string types. As long as one change occurs, the plug-in will grab the new participle and update the lexicon.
The content format returned by the http request is a line-by-line participle, and the newline character is n.
Hot update segmentation can be achieved by satisfying the above two requirements without restarting ES instances.

Hot words that need to be updated automatically can be placed in a UTF-8 encoded. txt file, under nginx or other simple http server. When the. txt file is modified, http server automatically returns the corresponding Last-Modified and ETag when the client requests the file. Another tool can be used to extract relevant vocabulary from the business system and update this. txt file.

Participle test

curl -XPUT "http://localhost:9200/index"

Test the effect of word segmentation:
Allow in browser

http://localhost:9200/index/_analyze?analyzer=ik_max_word&text=The People's Republic of China

Result

{
    "tokens": [
        {
            "token": "The People's Republic of China",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "The Chinese people",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "The Chinese people",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "Chinese",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "People's Republic",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "the people",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "Republic",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "Republic",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "country",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 8
        }
    ]
}

Logstash 5.X Log Collection and Processing

Download address
https://www.elastic.co/downloads/logstash
Currently the latest version 5.4.0

Version: 5.4.0
Users using hadoop

Here use TAR.GZ source installation, that is, mode 1

cd ~
wget https://artifacts.elastic.co/downloads/logstash/logstash-5.4.0.tar.gz
tar -zxvf logstash-5.4.0.tar.gz

Test for successful installation

~/logstash-5.4.0/bin/logstash -e 'input { stdin { } } output { stdout {}}'

If the output is as follows, the installation is successful

The stdin plugin is now waiting for input:
[2017-05-16T21:48:15,233][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600} 

Logstash 5.X configuration

Create a configuration directory
Enter Logstash root directory first

cd ~/logstash-5.4.0
mkdir -p etc
vim etc/www.lanmps.com.conf

Content of etc/test.conf file

input {
  file {
    type => "nginx-access"
    path => ["/www/wwwLogs/www.lanmps.com/*.log"]
    start_position => "beginning"
  }
}

filter {
    grok {
        "message"=>"%{IPORHOST:client_ip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:http_version})?|-)\" (%{HOSTNAME:domain}|-) %{NUMBER:response} (?:%{NUMBER:bytes}|-) (%{QS:referrer}) %{QS:agent} \"(%{WORD:x_forword}|-)\" (%{URIHOST:upstream_host}|-) (%{NUMBER:upstream_response}|-) (%{WORD:upstream_cache_status}|-) %{QS:upstream_content_type} (%{USERNAME:upstream_response_time}) > (%{USERNAME:response_time})"
    #Matching mode message is a log read in every paragraph. IP, HTTPDATE, WORD, NOTSPACE and NUMBER are all regular format names defined in patterns/grok-patterns. Comparing with the above log, colon, (?:%{USER:ident} -) is a conditional judgment, equivalent to binary operation in the program. If you have double quotation marks "" or [], you need to add \ before escaping.
    }
    kv {
                source => "request"
                field_split => "&?"
                value_split => "="
        }
  #Then the obtained URL and request fields are taken out separately to match the key-value values, requiring a kv plug-in. Providing the field separator "&?" and the value key separator "=" automatically collects the fields and values.
    urldecode {
        all_fields => true
    }
  #urldecode all fields (display Chinese)
}

output {
  elasticsearch {
        hosts => ["10.1.5.66:9200"]
        index => "logstash-%{type}-%{+YYYY.MM.dd}"
        document_type => "%{type}"
  }
}

Configuration description
http://kibana.logstash.es/content/logstash/plugins/input/file.html

Nginx Log Format Definition

log_format access '$remote_addr - $remote_user [$time_local] "$request" $http_host $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_x_forwarded_for" $upstream_addr $upstream_status $upstream_cache_status "$upstream_http_content_type" $upstream_response_time > $request_time';

Logstash 5.X Start and Stop

Test command

cd ~/logstash-5.4.0/
bin/logstash -e 'input { stdin { } } output { stdout {codec=>rubydebug} }'

Then you will find that the terminal is waiting for your input. No problem, type Hello World, return, and see what results will be returned!
The following results appear

2017-02-23T08:34:25.661Z c-101 Hello World

Test the configuration file for correctness

cd ~/logstash-5.4.0/
bin/logstash -t -f etc/

start-up

Load all *. conf text files in the etc folder, and then stitch them together into a complete large configuration file in your own memory

cd ~/logstash-5.4.0/
bin/logstash -f etc/

Background operation

nohup cd ~/logstash-5.4.0/ && bin/logstash -f etc/ &

Stop it

Find process ID

ps -ef |grep logstash

Find this ID, KILL, he

kill -9  id

Posted by TheBeginner on Sun, 06 Jan 2019 18:00:09 -0800