Use of Go Scanner and Source Code Analysis

Keywords: Go

brief introduction

The go standard library bufio.Scanner is literally a scanner and scanner. The data soldier cache is continuously read from a reader, and an injection function is provided to customize the partitioner. Four predefined segmentation methods are also provided in the library.

  • ScanLines: Separated by newline characters ('n')
  • ScanWords: Returns a word that is partitioned by a "space"
  • ScanRunes: Returns a single UTF-8 encoded rune as a token
  • ScanBytes: Returns a single byte as a token

Usage method

Before we look at how to use it, we need to look at a function first.

type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)

This function accepts a byte array and returns three return values with an atEOF flag bit, which is used to indicate whether there is more data. The first is the number of bytes pushing the input (usually the number of token bits)
The splist function determines whether the flag bit is found or not, and if not, Scan can return (0,nil,nil) Scan to get the return value and continue to read the characters that have not been completed after reading. If it is found, it is returned with the correct return value. Here's a simple example

func main() {
    input := "abcend234234234"
    fmt.Println(strings.Index(input,"end"))
    scanner := bufio.NewScanner(strings.NewReader(input))
    scanner.Split(ScanEnd)
    //Set Read Buffer Read Size to read 2 bytes at a time. Double the Buffer Size if the Buffer is insufficient
    buf := make([]byte, 2)
    scanner.Buffer(buf, bufio.MaxScanTokenSize)
    for scanner.Scan() {
        fmt.Println("output:",scanner.Text())
    }
    if scanner.Err() != nil {
        fmt.Printf("error: %s\n", scanner.Err())
    }
}

func ScanEnd(data []byte, atEOF bool) (advance int, token []byte, err error) {
    //If the data is empty, the data has been read and returned directly.
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    // Get the location of the custom end flag bit
    index:= strings.Index(string(data),"end")
    if index > 0{
        //If the first parameter returned is found to be the backward character length  
        //The second parameter is the character before the index bit. 
        //The third parameter is whether there is an error or not.
        return index+3, data[0:index],nil
    }
    if atEOF {
        return len(data), data, nil
    }
    //If not found, return 0, nil, nil
    return 0, nil, nil
}

The above example shows that the string is "abcend234234234"“
Because it's set to read two strings at a time
First read: buf = ab did not find end ScanEnd and returned 0,nil,nil
Second read: buf = abce did not find end ScanEnd and returned 0,nil,nil
Third read: buf = abcend23(buf doubling expansion) find the custom flag end and return: 6, abc, nil call out abc
Fourth read: buf = 23423423 before the read was removed, hesitating to read 8 characters directly with the size of buf
Fifth read: because buf capacity is insufficient to double, direct access to all data output out 234234234
The result is:
output: abc
output: 234234234
You can see that the scanner outputs the results according to the custom read size and the token Terminator

Source code view

type Scanner struct {
    r            io.Reader // reader
    split        SplitFunc // Partitioning Function and External Injection
    maxTokenSize int       // Maximum length of token
    token        []byte    // The last token returned by split
    buf          []byte    // Buffer character
    start        int       // The first unprocessed byte in buf
    end          int       // Data End Marker in buf
    err          error     // Sticky error.
    empties      int       // Counting of Continuous Empty Tokens
    scanCalled   bool      //
    done         bool      // Is the scan completed?
}

func (s *Scanner) Scan() bool {
    if s.done {
        return false
    }
    s.scanCalled = true
    // for loop until token is found
    for {
        if s.end > s.start || s.err != nil {
            // Call the split function to get the return value. The function determines whether there is an error in token token's backward token number.
            advance, token, err := s.split(s.buf[s.start:s.end], s.err != nil)
            if err != nil {
                if err == ErrFinalToken {
                    s.token = token
                    s.done = true
                    return true
                }
                s.setErr(err)
                return false
            }
            if !s.advance(advance) {
                return false
            }
            s.token = token
            if token != nil {
                if s.err == nil || advance > 0 {
                    s.empties = 0
                } else {
                    // Returning tokens not advancing input at EOF.
                    s.empties++
                    if s.empties > 100 {
                        panic("bufio.Scan: 100 empty tokens without progressing")
                    }
                }
                return true
            }
        }
        //If there are errors, return false
        if s.err != nil {
            // Shut it down.
            s.start = 0
            s.end = 0
            return false
        }
        //Read more data by resetting the start and end positions
        if s.start > 0 && (s.end == len(s.buf) || s.start > len(s.buf)/2) {
            copy(s.buf, s.buf[s.start:s.end])
            s.end -= s.start
            s.start = 0
        }
        // If buf is full, if full, recreate a buf twice the length of the original
        if s.end == len(s.buf) {
            const maxInt = int(^uint(0) >> 1)
            if len(s.buf) >= s.maxTokenSize || len(s.buf) > maxInt/2 {
                s.setErr(ErrTooLong)
                return false
            }
            newSize := len(s.buf) * 2
            if newSize == 0 {
                newSize = startBufSize
            }
            if newSize > s.maxTokenSize {
                newSize = s.maxTokenSize
            }
            newBuf := make([]byte, newSize)
            copy(newBuf, s.buf[s.start:s.end])
            s.buf = newBuf
            s.end -= s.start
            s.start = 0
        }
        //If not, continue reading data later
        for loop := 0; ; {
            n, err := s.r.Read(s.buf[s.end:len(s.buf)])
            s.end += n
            if err != nil {
                s.setErr(err)
                break
            }
            if n > 0 {
                s.empties = 0
                break
            }
            loop++
            if loop > maxConsecutiveEmptyReads {
                s.setErr(io.ErrNoProgress)
                break
            }
        }
    }
}

summary

According to the source code and examples above, we can see the function of this scanner. Of course, when used formally, it will not only read a dead string. The IO buffer provides a temporary storage area to store data. The data stored in the buffer will be "released" after it reaches a certain capacity for the next storage. This way greatly reduces the number of write operations or the number of triggers of the final system call, which will undoubtedly save a lot of system resources when frequently using system resources. Overhead. For read operations, buffering IO means that more data can be read per operation, which not only reduces the number of system calls, but also makes more efficient use of underlying hardware by reading hard disk data in blocks.

Posted by dmarquard on Tue, 23 Apr 2019 21:12:34 -0700