Lexical Analysis and Parsing Part Two of Go Translation

Keywords: Go github

Author: Adam Presley | Address: https://adampresley.github.io...

Translator's Preface

This article is about the implementation of lexical devices. If you encounter difficulties in reading, it is recommended to refer to the source code. The code snippets in this article are for the purpose of introducing ideas. How to parse will be introduced in the next article.

Recently, a brief look at the Go source code, in the src/go directory, there are several modules, token, scanner and parser should be the core code of the related implementation of Go lexicon. Opening the token directory, you will find many similarities between the source code and the content described in the previous section.

Due to the large number of concurrent tasks recently, it is not possible to update them as quickly as possible. In addition to this series, I have posted links to other related articles below. If you have a good command of English reading, you can read them by yourself.

A look at Go lexer/scanner packages
Rob Pike's Functional Way
Handwritten Parser & Lexers In Go

The translation is as follows:

This series First article(English original I introduced some basic concepts about lexical analysis and parsing and the basic composition of INI file content. Later, we created some related structures and constants to help implement the next INI text parser.

This article will actually go into the details of lexical analysis.

Lexing refers to the process of transforming input text into a series of Tokens. Token is a smaller unit than text. Combining them together can produce meaningful content, such as programs, configuration files, etc.

In the INI file in this series of articles, Token includes left parentheses, right parentheses, SectionName, Key, Value, and equals. Combine them in the right order and you will have an INI file. The functions of the lexicon are to read the contents of the INI file, analyze and create Token, and send Token to the parser through channel.

Lexical analyzer

In order to convert text to Token, we also need to track some information, such as the content of the text, the location of the current analysis text, and the start and end of the current analysis of Token.

After completing the analysis, we also send Token to the parser, which can be passed through channel.

We also need a function to track the lexical state. Rob Pike talked about using functions to track the current and subsequent expected states of lexicographers. Simply put, a function processes a Token and returns the next state function to generate the next expected Token. Next, I will simply translate it into a state function.

Take an example.

Section in INI consists of three parts: left bracket, Section Name and right bracket. The first function generates Token of the left bracket type and returns the state function of SectionName, which will analyze the relevant logic for processing SectionName and return the state function for processing the right bracket. The overall order is left bracket - > section name - > right bracket.

Let's look at the structure of lexical devices. As follows:

Lexer.go

type Lexer struct {
  Name   string
  Input  string  // Input text
  Tokens chan lexertoken.Token // channel used to send Token to lexical analyzer
  State  LexFn   // The state function mentioned above

  Start int      // The starting and ending positions of token can be obtained by starting + len (token)
  Pos   int      // The lexical processor processes the text position, which is equivalent to knowing Token's end position when confirming Token's end.
  Width int
}

LexFn.go

type LexFn func(*Lexer) LexFn  // The definition of the lexical state function returns the next expected Token analysis function.

In the previous article, we have defined the Token structure. LexFn is a lexical state function type used to process Token.

Now add some capabilities to our foreword grammar. Lexer is used for text processing. To get the next Token, we add some useful methods to Lexer, such as reading rune strings, skipping spaces, and so on. Basically, they are some simple methods of text processing.

/*
Puts a token onto the token channel. The value of this token is
read from the input based on the current lexer position.
*/
func (this *Lexer) Emit(tokenType lexertoken.TokenType) {
    this.Tokens <- lexertoken.Token{Type: tokenType, Value: this.Input[this.Start:this.Pos]}
    this.start = this.Pos
}

/*
Increment the position
*/
func (this *Lexer) Inc() {
    this.Pos++
    if this.Pos >= utf8.RuneCountInString(this.Input) {
        this.Emit(lexertoken.TOKEN_EOF)
    }
}

/*
Return a slice of the input from the current lexer position
to the end of the input string.
*/
func (this *Lexer) InputToEnd() string {
    return this.Input[this.Post:]
}

/*
Skips whitespace until we get something meaningful
*/
func (this *Lexer) SkipWhiteSpace() {
    for {
        ch := this.Next()
        if !unicode.IsSpace(ch) {
            this.Dec()
            break
        }

        if ch == lexertoken.EOF {
            this.Emit(lexertoken.TOKEN_EOF)
            break
        }
    }
}

The key point to understand is that Token reads and sends. The main steps are as follows:

First, the characters are read until a definite Token is formed. For example, the state function of SectionName can confirm SectionName only by reading the right parentheses.
Next, the Token and Token types are sent to the parser through channel.
Finally, the next expected state function is judged and returned.

Let's first define a starter function. It is also the startup entry for the parser (next article). It initializes a Lexer and gives it the first state function.

What might be the first expected Token? A special symbol or a key word?

In our example, the first state function will be named after a generic name, LexBegin, because in the INI file, section can start, but there can also be no section, which starts with key/value. LexBegin handles this logic.

/*
Start a new lexer with a given input string. This returns the
instance of the lexer and a channel of tokens. Reading this stream
is the way to parse a given input and perform processing.
*/
func BeginLexing(name, input string) *lexer.Lexer {
    l := &lexer.Lexer{
        Name: name,
        Input: input,
        State: lexer.LexBegin,
        Tokens: make(chan lexertoken.Token, 3),
    }

    return l
}

start

The first state function is LexBegin.

/*
This lexer function starts everything off. It determines if we are
beginning with a key/value assignment or a section.
*/
func LexBegin(lexer *Lexer) LexFn {
    lexer.SkipWhitespace()
    if strings.HasPrefix(lexer.InputToEnd(), lexertoken.LEFT_BRACKET) {
        return LexLeftBracket
    } else {
        return LexKey
    }
}

As you can see, first of all, all spaces are skipped. In INI files, spaces are meaningless. Next, we need to confirm whether the first character is left bracket, and if it is, we return LexLetBracket, otherwise it's the key type, returning the LexKey state function.

Section

Start with the introduction of processing logic for section.

The Selection Name in the INI file is wrapped in left and right parentheses. We can organize Key/Value in a Section. In LexBegin, if left parentheses are found, the LexLeftBracket function is returned.

The code for LexLeft Bracket is as follows:

/*
This lexer function emits a TOKEN_LEFT_BRACKET then returns
the lexer for a section header.
*/
func LexLeftBracket(lexer *Lexer) LexFn {
    lexer.Pos += len(lexertoken.LEFT_BRACKET)
    lexer.Emit(lexertoken.TOKEN_LEFT_BRACKET)
    return LexSection
}

The code is simple! According to the length of parentheses (length bit 1), the position of the lexicograph is moved back, and then TOKEN_LEFT_BRACKET is sent to channel.

In this scenario, Token content doesn't make sense. When Emit is finished, the start position is assigned to the current position of the lexicon, which will be ready for the next Token. Finally, the state function used to process SectioName, LexSection, is returned.

/*
This lexer function exits a TOKEN_SECTION with the name of an
INI file section header.
*/
func LexSection(lexer *Lexer) LexFn {
    for {
        if lexer.IsEOF() {
            return lexer.Errorf(errors.LEXER_ERROR_MISSING_RIGHT_BRACKET)
        }

        if strings.HasPrefix(lexer.InputEnd(), lexertoken.RIGHT_BRACKET) {
            lexer.Emit(lexertoken.TOKEN_SECTION)
            return LexRightBracket
        }

        lexer.Inc()
    }
}

The logic is a little complicated, but the basic logic is the same.

The end position of SectionName can be confirmed by traversing the character through a loop in the function until RIGHT_BRACKET, or right parentheses, is encountered. If we encounter EOF, it means that it is an INI with an incorrect format. We should give an error prompt and send it to the parser through channel. If normal, it will continue to loop until the right parentheses are found, and then TOKEN_SECTION and the corresponding text are sent out.

The state function returned by LexSection is LexerRightBracket. The logic is similar to LexerLeftBracket. The difference is that it returns LexBegin, because Section may be empty Section or Key/Value.

/*
This lexer function emits a TOKEN_RIGHT_BRACKET then returns
the lexer for a begin.
*/
func LexRightBracket(lexer *Lexer) LexFn {
    lexer.Pos += len(lexertoken.RIGHT_BRACKET)
    lexer.Emit(lexertoken.TOKEN_RIGHT_BRACKET)
    return LexBegin
}

Key/Value

Continuing with the introduction of Key/Value processing, the expression is very simple: key=value.

First, the key process, similar to LexSection, loops until an equal sign is encountered to determine a complete key. Then Emit is executed to send Key and return the state function LexEqualSign.

/*
This lexer function emits a TOKEN_KEY with the name of an
key that will assigned a value
*/
func LexKey(lexer *Lexer) LexFn {
    for {
        if strings.HasPrefix(lexer.InputToEnd(), lexertoken.EQUAL_SIGN) {
            lexer.Emit(lexertoken.TOKEN_KEY)
            return LexEqualSign
        }

        lexer.Inc()
        if lexer.IsEOF() {
            return lexer.Errorf(errors.LEXER_ERROR_UNEXPECTED_EOF)
        }
    }
}

Equal sign processing is very simple, similar to left and right parentheses. Send the TOKEN_EQUAL_SIGN type Token directly to the parser and return to LexValue.

/*
This lexer functions emits a TOKEN_EQUAL_SIGN then returns
the lexer for value.
*/
func LexEqualSign(lexer *Lexer) LexFn {
    lexer.Pos += len(lexertoken.EQUAL_SIGN)
    lexer.Emit(lexertoken.EQUAL_SIGN)

    return LexValue
}

The final state function introduced is LexValue, which is used to process the Value part of Key/Value. It confirms a complete Value when it encounters a newline character. It returns a state function called LexBegin to continue the next round of analysis.

/*
This lexer function emits a TOKEN_VALUE with the value to be assigned
to a key.
*/
func LexValue(lexer *Lexer) LexFn {
    for {
        if strings.HasPrefix(lexer.InputToEnd(), lexertoken.NEWLINE) {
            lexer.Emit(lexertoken.TOKEN_VALUE)
            return LexBegin
        }

        lexer.Inc()

        if lexer.IsEOF() {
            return lexer.Errorf(errors.LEXER_ERROR_UNEXPECTED_EOF)
        }
    }
}

Next

stay Part 3 In the final article of this series, we will show you how to create a basic parser that processes Token from lexer into the structured data we expect.

Posted by receiver on Wed, 31 Jul 2019 19:20:36 -0700