Full text search engine Elasticsearch basic tutorial

Keywords: ElasticSearch search engine lucene

Full text search is the most common requirement. Open source Elasticsearch is the first choice of full-text search engines. It can quickly store, search and analyze massive data. Wikipedia, Stack Overflow and Github all use it.
At the bottom of elasticsearch is the open source library Lucene. However, you can't use Lucene directly. You must write your own code to call its interface. Elasticsearch is an encapsulation of Lucene and provides the operation interface of REST API, which can be used out of the box.

1 installation

Taking v7.9.1 as an example, the latest version is v7.15.1

1.1 manual installation

file: https://www.elastic.co/guide/en/elasticsearch/reference/current/targz.html

# Take Linux system as an example

# download
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz

# Integrity check
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz.sha512
$ shasum -a 512 -c elasticsearch-7.9.1-linux-x86_64.tar.gz.sha512 

# decompression
$ tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz

# Enter the extracted directory
$ cd elasticsearch-7.9.1

# start-up
$ ./bin/elasticsearch

1.2 k8s deployment

file: https://www.elastic.co/guide/en/elasticsearch/reference/7.15/docker.html

# deployment.yaml

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: &PROJECT_NAME-elasticsearch
  namespace: default
spec:
  selector:
    matchLabels:
      name: &PROJECT_NAME-elasticsearch
  replicas: 1
  strategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        author: yunfu
        name: &PROJECT_NAME-elasticsearch
    spec:
      containers:

   - image: elasticsearch:7.9.1
     imagePullPolicy: IfNotPresent
     securityContext:
       privileged: true
     name: &PROJECT_NAME-elasticsearch
     volumeMounts:
       - mountPath: /usr/share/elasticsearch/data
         name: data
       - mountPath: /usr/share/elasticsearch/plugins
         name: plugins
       - mountPath: /usr/share/elasticsearch/config/analysis
         name: analysis
         env:
       - name: PROJECT_NAME
         value: &PROJECT_NAME
       - name: discovery.type
         value: single-node
       - name: indices.query.bool.max_clause_count
         value: '40960'
         ports:
       - containerPort: 9200
         nodeName: s03
         volumes:
           - name: data
             hostPath:
               path: &MOUNT_PATH/elasticsearch/data
           - name: plugins
             hostPath:
               path: &MOUNT_PATH/elasticsearch/plugins
           - name: analysis
             hostPath:
               path: &MOUNT_PATH/elasticsearch/analysis

---
apiVersion: v1
kind: Service
metadata:
  name: &PROJECT_NAME-elasticsearch
  namespace: default
spec:
  type: NodePort
  selector:
    name: &PROJECT_NAME-elasticsearch
  ports:
    - port: 9200
      targetPort: 9200
      nodePort: 9200

1.3 verification

If everything is OK, Elasticsearch will run on the default 9200 port. At this time, open another command line window and request the port, and you will get the description information.

GET http://192.168.2.251:9200

# Returns a JSON object containing the current node, cluster, version and other information
{
  "name" : "yftool-db-elasticsearch-78f7896866-6qmsn",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "mb34onKKSr2InWHVNo1aAg",
  "version" : {
    "number" : "7.9.1",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "083627f112ba94dffc1232e8b42b73492789ef91",
    "build_date" : "2020-09-01T21:22:21.964974Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

2 basic concepts

2.1 Node and Cluster

Elasticsearch is essentially a distributed database that allows multiple servers to work together. Each server can run multiple elasticsearch instances.
A single Elasticsearch instance is called a node.
A group of nodes form a cluster.

2.2 Index

Elasticsearch will index all fields and write an Inverted Index after processing. When searching for data, directly search the index.
Therefore, the top-level unit of elastic search data management is called Index. It is synonymous with a single database.
The name of each Index (i.e. database) must be lowercase.

# View all indexes of the current node
GET 'http://192.168.2.251:9200/_cat/indices?v'

health status index           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   news            Du-H_btHQ6e3t3MxJEaAWg   1   1      78565            0    262.8mb        262.8mb
yellow open   nginx           wFbtCIBwRaGEM7TsfBnPdA   1   1     335843            0     10.2mb         10.2mb
yellow open   due_search_news yXeJv7e8T6iyTKSho1Y_8w   1   1       1696            0    374.9kb        374.9kb
yellow open   nginx2          yux1FaBcRCSDpcuGyWGw2Q   1   1      64435            0      2.1mb          2.1mb

2.3 Document

A single record in the Index is called a Document. Many documents form an Index.

Document is expressed in JSON format. Here is an example.

{
  "user": "Zhang San",
  "title": "engineer",
  "desc": "Database management"
}

The documents in the same Index do not need to have the same scheme, but it is better to keep the same, which is conducive to improving the search efficiency.

2.4 Type

Document s can be grouped. For example, in the weather Index, they can be grouped by city (Beijing and Shanghai) or by climate (sunny and rainy days). This grouping is called Type. It is a virtual logical grouping used to filter documents.
Different types should have similar schema s. For example, the id field cannot be a string in one group and a value in another group. This is a difference from tables in relational databases.
Data with completely different properties (such as products and logs) should be stored as two indexes instead of two types in one Index (although it can be done).
Elasticsearch version 6. X only allows each Index to contain one Type, and version 7.x has completely removed the Type. (default: _doc)

3 create and delete Index

When you create a new Index, you can directly send a PUT request to the Elasticsearch server. The following example is to create a new Index named weather.
```
PUT http://192.168.2.251:9200/weather
```
The server returns a JSON object in which the acknowledged field indicates that the operation is successful.
```
{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "weather"
}
```

Then, we issue a DELETE request to DELETE the Index.

DELETE http://192.168.2.251:9200/weather

{
    "acknowledged": true
}

4 Chinese word segmentation settings

First, install the Chinese word segmentation plug-in. What is used here is ik , you can also consider other plug-ins (such as smartcn).

$ ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.9.1/elasticsearch-analysis-ik-7.9.1.zip

The above code installs the plug-in version 7.9.1, which is used with Elasticsearch 7.9.1.
Then, restart Elastic and the newly installed plug-in will be loaded automatically.

Then, create a new Index and specify the fields to be segmented. This step varies according to the data structure. The following commands are only for this article. Basically, all Chinese fields to be searched should be set separately.

PUT http://192.168.2.251:9200/accounts

{
    "mappings":{
        "properties":{
            "user":{
                "type":"text",
                "analyzer":"ik_max_word",
                "search_analyzer":"ik_max_word"
            },
            "title":{
                "type":"text",
                "analyzer":"ik_max_word",
                "search_analyzer":"ik_max_word"
            },
            "desc":{
                "type":"text",
                "analyzer":"ik_max_word",
                "search_analyzer":"ik_max_word"
            }
        }
    }
}

In the above code, first create an Index named accounts with three fields: user, title and desc
These three fields are all Chinese, and the type is text, so you need to specify a Chinese word splitter, and the default English word splitter cannot be used.
Elasticsearch's word splitter is called analyzer . we specify a word breaker for each field.

"user": {
  "type": "text",
  "analyzer": "ik_max_word",
  "search_analyzer": "ik_max_word"
}

In the above code, analyzer is the word separator of field text, and search_analyzer is the word separator of search words. ik_max_word word separator is provided by plug-in ik, which can segment the text in the maximum number.

5 data operation

5.1 new records

Send a PUT request to the specified Index to add a record in the Index. For example, send a request to / accounts/person to add a personnel record.
```
PUT http://192.168.2.251:9200/accounts/_doc/1
{
  "user": "Zhang San",
  "title": "engineer",
  "desc": "Database management"
}
```

The JSON object returned by the server will give information such as Index, Type, Id, Version, etc.

{
    "_index": "accounts",
    "_type": "_doc",
    "_id": "1",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 0,
    "_primary_term": 1
}

If you look carefully, you will find that the request path is / accounts/_doc/1, and the last 1 is the Id of the record. It is not necessarily a number, but any string (such as abc).

When adding a record, you can also not specify an Id. at this time, it should be changed to a POST request.

POST http://192.168.2.251:9200/accounts/_doc
{
  "user": "Li Si",
  "title": "engineer",
  "desc": "system management"
}

In the above code, send a POST request to / accounts / _docto add a record. At this time, the _id field in the JSON object returned by the server is a random string.

{
    "_index": "accounts",
    "_type": "_doc",
    "_id": "Ds25knwBPdZtIfHNf2y9",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 1,
    "_primary_term": 1
}

5.2 viewing records

You can view this record by issuing a GET request to / Index/_doc/Id.
```
GET http://192.168.2.251:9200/accounts/_doc/1?pretty=true
```
The above code requests to view the record / accounts/_doc/1. The URL parameter pretty=true indicates that it is returned in an easy to read format.

In the returned data, the found field indicates that the query is successful, and the _source field returns the original record.

{
    "_index": "accounts",
    "_type": "_doc",
    "_id": "1",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "user": "Zhang San",
        "title": "engineer",
        "desc": "Database management"
    }
}

If the Id is incorrect, the data cannot be found, and the found field is false.

GET http://192.168.2.251:9200/accounts/_doc/-1?pretty=true

# response
{
    "_index": "accounts",
    "_type": "_doc",
    "_id": "-1",
    "found": false
}

5.3 deleting records

To DELETE a record is to issue a DELETE request.

DELETE http://192.168.2.251:9200/accounts/_doc/1

5.4 update records

To update a record is to resend the data using a PUT request.

PUT http://192.168.2.251:9200/accounts/_doc/1
{
    "user" : "Zhang San",
    "title" : "engineer",
    "desc" : "Database management, software development"
}

{
    "_index": "accounts",
    "_type": "_doc",
    "_id": "1",
    "_version": 2,
    "result": "updated",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 2,
    "_primary_term": 1
}

In the above code, we changed the original data from "database management" to "database management, software development". In the returned results, several fields have changed.
```
"_version" : 2,
"result" : "updated"
```
You can see that the Id of the record has not changed, but the version has changed from 1 to 2, and the operation type (result) has changed from created to updated. This time, it is not a new record.

6 data query

6.1 return all records

Using the GET method, directly request / Index/_doc/_search, and all records will be returned.

GET http://192.168.2.251:9200/accounts/_doc/_search

# response
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "accounts",
                "_type": "_doc",
                "_id": "Ds25knwBPdZtIfHNf2y9",
                "_score": 1.0,
                "_source": {
                    "user": "Li Si",
                    "title": "engineer",
                    "desc": "system management"
                }
            },
            {
                "_index": "accounts",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "user": "Zhang San",
                    "title": "engineer",
                    "desc": "Database management, software development"
                }
            }
        ]
    }
}

In the above code, the took field of the returned result indicates the time-consuming of the operation (in milliseconds), timed_ The out field indicates whether to timeout, and the hits field indicates the hit record. The meanings of the sub fields are as follows:
- total: returns the number of records. In this example, there are 2 records.
- max_score: the highest matching degree. In this example, it is 1.0.
- hits: an array of returned records.
In the returned records, there is one for each record_ The score field indicates the matching programs. By default, it is arranged in descending order according to this field.

6.2 full text search

Elasticsearch's query is very special and uses its own Query syntax , requires a GET request with a data body.

GET http://192.168.2.251:9200/accounts/_doc/_search
{
    "query":{
        "match":{
            "desc":"Software"
        }
    }
}

The above code uses Match query , the specified matching condition is that the desc field contains the word "software". The returned results are as follows:

{
    "took": 70,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.6235748,
        "hits": [
            {
                "_index": "accounts",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.6235748,
                "_source": {
                    "user": "Zhang San",
                    "title": "engineer",
                    "desc": "Database management, software development"
                }
            }
        ]
    }
}

Elastic returns 10 results at a time by default. You can change this setting through the size field.

GET http://192.168.2.251:9200/accounts/_doc/_search
{
    "query":{
        "match":{
            "desc":"Software"
        }
    }
    "size": 1
}

The above code specifies that only one result is returned at a time.

You can also specify the displacement through the from field.

GET http://192.168.2.251:9200/accounts/_doc/_search
{
    "query":{
        "match":{
            "desc":"Software"
        }
    }
    "from": 1,
    "size": 1
}

The above code specifies that only one result is returned from position 1 (the default is position 0).

6.3 logic operation

If there are multiple search keywords, Elastic thinks they are or relationships.

GET http://192.168.2.251:9200/accounts/_doc/_search
{
    "query":{
        "match":{
            "desc":"Software system"
        }
    }
}

The code above searches for software or systems.

If you want to perform an and search for multiple keywords, you must use a Boolean query.

GET http://192.168.2.251:9200/accounts/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "desc": "Software" } },
        { "match": { "desc": "system" } }
      ]
    }
  }
}

Posted by Restless on Fri, 03 Dec 2021 09:10:48 -0800

Programmer Group