[Go Web learning notes] Chapter 12 Go and regular expressions

Keywords: Go

Hello, everyone. All the following are my notes when I study teacher Han Ru's tutorial. Some contents have been deleted and modified. It is recommended that you read the original author's documents for learning. This article is only used as personal learning notes, and will be continuously modified on this basis in the future. When learning Go Web, you should be familiar with the basic grammar of Go language and relevant contents of computer network.
Learning links: https://www.chaindesk.cn/witbook/17/253
Reference book: Go Web programming Xie Mengjun

Chapter 12 regular expressions

Regular expression is a complex and powerful tool for pattern matching and text manipulation. Although regular expressions are less efficient than pure text matching, they are more flexible. According to its syntax rules, the matching pattern constructed on demand can filter almost any character combination you want from the original text. If you need to obtain data from some text data sources in Web development, you can extract meaningful text information from the original data source only by constructing the correct pattern string on demand according to its syntax rules.

The Go language provides official support for regular expressions through the regexp standard package. If you have used the regular related functions provided by other programming languages, you should not be too familiar with the Go language version, but there are some small differences between them, because Go implements the RE2 standard, except \ C, For detailed syntax description, refer to: http://code.google.com/p/re2/wiki/Syntax

In fact, for string processing, we can use the strings package to search (Contains, Index), replace and parse (Split, Join), but these are simple string operations. Their search is case sensitive and fixed strings. If we need to match the variable ones, we can't realize them, Of course, if the strings package can solve your problem, try to use it to solve it. Because they are simple enough, and their performance and readability will be better than regular.

If you remember, in the previous section of form validation, we have touched on regular processing, where we use it to verify whether the input information meets some preset conditions. One thing to note in use is that all characters are UTF-8 encoded. Next, let's learn more about the regexp package of Go language.

1. Match by regular

The regexp package contains three functions to judge whether it matches. If it matches, it returns true, otherwise it returns false

func Match(pattern string, b []byte) (matched bool, error error)
func MatchReader(pattern string, r io.RuneReader) (matched bool, error error)
func MatchString(pattern string, s string) (matched bool, error error)

The above three functions realize the same function, that is, to judge whether the pattern matches the input source. If it matches, it returns true. If there is an error in parsing the regular, it returns error. The input sources of the three functions are byte slice, RuneReader and string.

If you want to verify whether an input is an IP address, how to judge it? See the implementation below

func IsIP(ip string) (b bool) {
    if m, _ := regexp.MatchString("^[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}$", ip); !m {
        return false
    }
    return true
}

As like as two peas, we can see that the pattern of regexp is exactly the same as that of our regular system.

Let's take another example: when a user enters a string, we want to know whether it is a legal input:

func IsNum(num string)(b bool){
    if m, _ := regexp.MatchString("^[0-9]+$", num); m {
        return true
    } else {
        return false
    }
}

Operation results:

In the above two small examples, we use Match(Reader|String) to judge whether some strings meet our description requirements. They are very convenient to use.

2. Get content via regular

Match mode can only be used to judge strings, but cannot intercept a part of strings, filter strings, or extract a batch of qualified strings. If you want to meet these requirements, you need to use the complex pattern of regular expressions.

We often need some crawler programs. Let's take the crawler as an example to illustrate how to use regular to filter or intercept the captured data:

package main

import (
    "io/ioutil"
    "regexp"
    "strings"
    "net/http"
    "fmt"
)

func main() {
    resp, err := http.Get("http://www.baidu.com")
    if err != nil {
        fmt.Println("http get error.")
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        fmt.Println("http read error")
        return
    }
    src := string(body)

    fmt.Println(src)
    fmt.Println("--------------------------------------------------------")
    //Convert all HTML tags to lowercase
    re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")
    src = re.ReplaceAllStringFunc(src, strings.ToLower)
    //Remove STYLE
    re, _ = regexp.Compile("\\<style[\\S\\s]+?\\</style\\>")
    src = re.ReplaceAllString(src, "")
    //Remove SCRIPT
    re, _ = regexp.Compile("\\<script[\\S\\s]+?\\</script\\>")
    src = re.ReplaceAllString(src, "")
    //Remove all HTML code in angle brackets and replace with line breaks
    re, _ = regexp.Compile("\\<[\\S\\s]+?\\>")
    src = re.ReplaceAllString(src, "\n")
    //Remove continuous line breaks
    re, _ = regexp.Compile("\\s{2,}")
    src = re.ReplaceAllString(src, "\n")
    fmt.Println(strings.TrimSpace(src))
}

As can be seen from this example, the first thing to use complex regular expressions is Compile, which will analyze whether the regular expression is legal. If it is correct, it will return a Regexp, and then you can use the returned Regexp to perform the required operations on any string.

There are several ways to parse regular expressions:

func Compile(expr string) (*Regexp, error)
func CompilePOSIX(expr string) (*Regexp, error)
func MustCompile(str string) *Regexp
func MustCompilePOSIX(str string) *Regexp

The difference between CompilePOSIX and Compile is that POSIX Must use POSIX syntax, which uses the leftmost and longest search method, while Compile only uses the leftmost search method (for example, when [a-z]{2,4} such a regular expression is applied to the text string "aa09aaa88aaaa", CompilePOSIX returns AAAA, while Compile returns aa). The prefix is represented by the function of Must. When parsing the regular syntax, if the matching pattern string does not meet the correct grammar, it will directly panic, and the one without Must will only return an error.

After knowing how to create a Regexp, let's take a look at the methods provided by this struct to assist us in operating strings. First, let's look at the following functions for searching:

func (re *Regexp) Find(b []byte) []byte
func (re *Regexp) FindAll(b []byte, n int) [][]byte
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
func (re *Regexp) FindAllString(s string, n int) []string
func (re *Regexp) FindAllStringIndex(s string, n int) [][]int
func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
func (re *Regexp) FindIndex(b []byte) (loc []int)
func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)
func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int
func (re *Regexp) FindString(s string) string
func (re *Regexp) FindStringIndex(s string) (loc []int)
func (re *Regexp) FindStringSubmatch(s string) []string
func (re *Regexp) FindStringSubmatchIndex(s string) []int
func (re *Regexp) FindSubmatch(b []byte) [][]byte
func (re *Regexp) FindSubmatchIndex(b []byte) []int

The above 18 functions can be further simplified into the following functions according to different input sources (byte slice, string and io.RuneReader). The other functions are basically the same except that the input sources are different:

func (re *Regexp) Find(b []byte) []byte
func (re *Regexp) FindAll(b []byte, n int) [][]byte
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
func (re *Regexp) FindIndex(b []byte) (loc []int)
func (re *Regexp) FindSubmatch(b []byte) [][]byte
func (re *Regexp) FindSubmatchIndex(b []byte) []int

For the use of these functions, let's take the following example:

package main

import (
    "regexp"
    "fmt"
)

func main() {
    a := "I am learning Go language"
    re, _ := regexp.Compile("[a-z]{2,4}")
    //Find the first regular
    one := re.Find([]byte(a))
    fmt.Println("Find:", string(one))
    //Find all slices that conform to the rule. If n is less than 0, all matching strings will be returned, or the specified length will be returned
    all := re.FindAll([]byte(a), -1)
    fmt.Println("FindAll", all)
    //Find the qualified index position, start position and end position
    index := re.FindIndex([]byte(a))
    fmt.Println("FindIndex", index)
    //Find all index positions that meet the conditions, n as above
    allindex := re.FindAllIndex([]byte(a), -1)
    fmt.Println("FindAllIndex", allindex)
    re2, _ := regexp.Compile("am(.*)lang(.*)")

    //Find Submatch and return the array. The first element is all matched elements, the second element is in the first () and the third element is in the second ()
    //The first element of the output below is "am learning Go language"
    //The second element is "learning Go". Note the output with spaces
    //The third element is "uage"
    submatch := re2.FindSubmatch([]byte(a))
    fmt.Println("FindSubmatch", submatch)
    for _, v := range submatch {
        fmt.Println(string(v))
    }
    //The definition is the same as FindIndex above
    submatchindex := re2.FindSubmatchIndex([]byte(a))
    fmt.Println(submatchindex)
    //FindAllSubmatch to find all matching submatches
    submatchall := re2.FindAllSubmatch([]byte(a), -1)
    fmt.Println(submatchall)
    //FindAllSubmatchIndex to find the index that matches all words
    submatchallindex := re2.FindAllSubmatchIndex([]byte(a), -1)
    fmt.Println(submatchallindex)
}

Operation results:

Find: am
FindAll [[97 109] [108 101 97 114] [110 105 110 103] [108 97 110 103] [117 97 103 101]]
FindIndex [2 4]
FindAllIndex [[2 4] [5 9] [9 13] [17 21] [21 25]]
FindSubmatch [[97 109 32 108 101 97 114 110 105 110 103 32 71 111 32 108 97 110 103 117 97 103 101] [32 108 101 97 114 110 105 110 103 32 71 111 32] [117 97 103 101]]
am learning Go language
 learning Go 
uage
[2 25 4 17 21 25]
[[[97 109 32 108 101 97 114 110 105 110 103 32 71 111 32 108 97 110 103 117 97 103 101] [32 108 101 97 114 110 105 110 103 32 71 111 32] [117 97 103 101]]]
[[2 25 4 17 21 25]]

As like as two peas, the Regexp function is defined as three functions, which are exactly the same as the external functions of the same name. In fact, the external functions are called the three functions of Regexp.

func (re *Regexp) Match(b []byte) bool

func (re *Regexp) MatchReader(r io.RuneReader) bool

func (re *Regexp) MatchString(s string) bool

Next, let's learn how replacement functions operate?

func (re *Regexp) ReplaceAll(src, repl []byte) []byte

func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte

func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte

func (re *Regexp) ReplaceAllLiteralString(src, repl string) string

func (re *Regexp) ReplaceAllString(src, repl string) string

func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string

We have detailed application examples of these replacement functions in the example of catching web pages above,

Next, let's take a look at the explanation of Expand:

func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte

func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte

So what is this expansion for? Take the following example:

package main

import (
    "regexp"
    "fmt"
)

func main() {
    src := []byte(`
        call hello alice
        hello bob
        call hello eve
        `)
    pat := regexp.MustCompile(`(?m)(call)\s+(?P<cmd>\w+)\s+(?P<arg>.+)\s*$`)
    res := []byte{}
    for _, s := range pat.FindAllSubmatchIndex(src, -1) {
        res = pat.Expand(res, []byte("$cmd('$arg')\n"), src, s)
    }
    fmt.Println(string(res))
}

Operation results:

hello('alice')
hello('eve')

So far, we have all introduced the regexp package of Go language. Through the introduction and demonstration of its main functions, I believe you should be able to carry out some basic regular operations through the regexp package of Go language.

Posted by lkq on Mon, 22 Nov 2021 08:34:15 -0800