Ruby 2.x Source Code Learning: Lexical Analysis

Keywords: Ruby C

Preface

Ruby did not use LEX to achieve lexical analysis, but chose its own handwritten lexical analyzer, combined with YACC (BISON) to achieve grammatical analysis, the relevant source code in parse.y (YACC grammar description) file.

Resolution identifier

parser_yylex in parse.y is the entry to the lexical analyzer, and parse_index is called at the end of the function to parse the identifier.

static int parser_yylex(struct parser_params *parser) {
    ...
    parse_ident(parser, c, cmd_state);
}

Let's first look at the local variable declaration at the beginning of the parse_ident function:

static int parse_ident(struct parser_params *parser, int c, int cmd_state) {
    int result = 0;
    const enum lex_state_e last_state = lex_state;
    ID ident;
}
  • result is used to save the grammar unit identifier that the function returns to YACC. It can be tIDENTIFIER, tCONSTANT or tLABEL.

  • last_state is used to save the internal state of LEX (lexical analyzer)

  • Identity is used to store the internal representation (index) of ident ifiers in Ruby interpreters

The function starts with a do-while loop to collect the characters that make up the identifier

do {
    if (!ISASCII(c)) mb = ENC_CODERANGE_UNKNOWN;
    if (tokadd_mbchar(c) == -1)
        return 0;
    c = nextc();
} while (parser_is_identchar());

if ((c == '!' || c == '?') && !peek('=')) {
    tokadd(c);
} else {
    pushback(c);
}
tokfix();
  • tokadd_mbchar, tokadd: Add character c to the identifier internal cache

  • parser_is_identchar: Is the character c a legitimate identifier?

  • pushback: fallback character

  • tokfix: Add 0 at the end of the internal cache of the identifier (string terminator in C language)

Let's skip the parse_ident function for keyword and other content judgments and look at the end of the function:

ident = tokenize_ident(parser, last_state);
if (!IS_lex_state_for(last_state, EXPR_DOT|EXPR_FNAME) &&
    (result == tIDENTIFIER) && /* not EXPR_FNAME, not attrasgn */
    lvar_defined(ident)) {
    SET_LEX_STATE(EXPR_END|EXPR_LABEL);
}
return result;

tokenize_ident is used to add identifiers to the symbol table inside the interpreter

static ID tokenize_ident(struct parser_params *parser, const enum lex_state_e last_state) {
    ID ident = TOK_INTERN();
    set_yylval_name(ident);
    return ident;
}

TOK_INTERN is a macro definition:

# parse.y

#ifdef RIPPER
#define intern_cstr(n,l,en) rb_intern3(n,l,en)
#else
#define intern_cstr(n,l,en) rb_intern3(n,l,en)
#endif

#define TOK_INTERN() intern_cstr(tok(), toklen(), current_enc)

Definitions of tok and toklen can be found in (generated from parse.y) parse.c.

#define tokbuf (parser->tokenbuf)
#define toklen (parser->tokidx)

#define tok() tokenbuf
#define toklen() tokidx

Code styles that use macro definitions to access structures or functions are ubiquitous in Ruby source code.

Let's move on to the rb_intern3 function.

# symbol.c
ID rb_intern3(const char *name, long len, rb_encoding *enc) {
    VALUE sym;
    struct RString fake_str;
    VALUE str = rb_setup_fake_str(&fake_str, name, len, enc);
    OBJ_FREEZE(str);

    sym = lookup_str_sym(str);
    if (sym) return rb_sym2id(sym);
    str = rb_enc_str_new(name, len, enc); /* make true string */
    return intern_str(str, 1);
}
  • rb_setup_fake_str creates a FAKE Ruby String object (structure) RString

  • lookup_str_sym uses the created RString to find symbols in the symbol table

  • If sym is found, the sym is converted to ID and returned directly

  • Otherwise, call rb_enc_str_new to create a "real" STR and call the intern_str function to insert it into the symbol table

Let's go back to the tokenize_ident function:

static ID tokenize_ident(struct parser_params *parser, const enum lex_state_e last_state) {
    ID ident = TOK_INTERN();
    set_yylval_name(ident);
    return ident;
}

After calling the TOK_INTERN macro to save the identifier to the symbol table, set_yylval_name(ident) sets yylval:

#ifndef RIPPER
...
# define set_yylval_name(x)  (yylval.id = (x))
...
#else
...

Parsing keywords

Keyword-related operations are mainly in the lex.c source code file. The comments in the header of the lex.c file show that the file is automatically generated using gperf.

/* C code produced by gperf version 3.0.4 */
/* Command-line: gperf -C -P -p -j1 -i 1 -g -o -t -N rb_reserved_word -k'1,3,$' defs/keywords  */

Keyword buffer pool stringpool_t

The stringpool_t structure encapsulates the keyword buffer pool
Since the code is automatically generated using gperf, there are some hard code s with the number str8.etc.

struct stringpool_t
{
    char stringpool_str8[sizeof("break")];
    char stringpool_str9[sizeof("else")];
    char stringpool_str10[sizeof("nil")];
    char stringpool_str11[sizeof("ensure")];
    char stringpool_str12[sizeof("end")];
    char stringpool_str13[sizeof("then")];
    char stringpool_str14[sizeof("not")];
    char stringpool_str15[sizeof("false")];
    char stringpool_str16[sizeof("self")];
    char stringpool_str17[sizeof("elsif")];
    char stringpool_str18[sizeof("rescue")];
    char stringpool_str19[sizeof("true")];
    char stringpool_str20[sizeof("until")];
    char stringpool_str21[sizeof("unless")];
    char stringpool_str22[sizeof("return")];
    char stringpool_str23[sizeof("def")];
    char stringpool_str24[sizeof("and")];
    char stringpool_str25[sizeof("do")];
    char stringpool_str26[sizeof("yield")];
    char stringpool_str27[sizeof("for")];
    char stringpool_str28[sizeof("undef")];
    char stringpool_str29[sizeof("or")];
    char stringpool_str30[sizeof("in")];
    char stringpool_str31[sizeof("when")];
    char stringpool_str32[sizeof("retry")];
    char stringpool_str33[sizeof("if")];
    char stringpool_str34[sizeof("case")];
    char stringpool_str35[sizeof("redo")];
    char stringpool_str36[sizeof("next")];
    char stringpool_str37[sizeof("super")];
    char stringpool_str38[sizeof("module")];
    char stringpool_str39[sizeof("begin")];
    char stringpool_str40[sizeof("__LINE__")];
    char stringpool_str41[sizeof("__FILE__")];
    char stringpool_str42[sizeof("__ENCODING__")];
    char stringpool_str43[sizeof("END")];
    char stringpool_str44[sizeof("alias")];
    char stringpool_str45[sizeof("BEGIN")];
    char stringpool_str46[sizeof("defined?")];
    char stringpool_str47[sizeof("class")];
    char stringpool_str50[sizeof("while")];
  };

The stringpool_contents variable is an example of a cache pool:

static const struct stringpool_t stringpool_contents =
  {
    "break",
    "else",
    "nil",
    "ensure",
    "end",
    "then",
    "not",
    "false",
    "self",
    "elsif",
    "rescue",
    "true",
    "until",
    "unless",
    "return",
    "def",
    "and",
    "do",
    "yield",
    "for",
    "undef",
    "or",
    "in",
    "when",
    "retry",
    "if",
    "case",
    "redo",
    "next",
    "super",
    "module",
    "begin",
    "__LINE__",
    "__FILE__",
    "__ENCODING__",
    "END",
    "alias",
    "BEGIN",
    "defined?",
    "class",
    "while"
  };

Determine whether a string is a key word?

The rb_reserved_word function determines whether the string str of len is a key word or not.

  • If the length of the string is not in the keyword length range, return 0 directly.

  • According to str, len calculates the index (key) in str's stringpool_contents mentioned above.

  • If the key is not in scope, return 0 directly.

  • Compare strings

const struct kwtable *rb_reserved_word(str, len) register const char *str;
        register unsigned int len; {
    if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) {
        register int key = hash (str, len);
        if (key <= MAX_HASH_VALUE && key >= 0) {
            register int o = wordlist[key].name;
            if (o >= 0) {
                register const char *s = o + stringpool;

                if (*str == *s && !strcmp (str + 1, s + 1))
                    return &wordlist[key];
            }
        }
    }
    return 0;
}

Posted by raymie7 on Sat, 06 Apr 2019 20:51:30 -0700