Introduction to python's interpreter spython

Keywords: C++ Python Programming JSON git

Introduction to Python Interpreter spython

brief introduction

I have been interested in python's interpreter for 16 years because of my hobbies and needs, and I am determined to re-implement a version. In my personal re-game server development, I have rich application experience in c++ embedded Lua and python, and I think I have a profound understanding of the advantages and disadvantages of both. Python's greatest advantage for Lua is that Python is a complete programming language. Classes and modules include rich libraries and easy-to-use string operations. It can be said that Python can be used to achieve many elegant functions, while lua's greatest advantage is compact and efficient, and lua_state of lua can have multiple instances, so that Lua can be used in multiple threads. The Python interpreter cannot implement multiple Python interpreter instances because of the global interpreter lock. Considering that the functions of Python used in embedded Python application scenarios are relatively simple and general functions, such as classes, modules, functions, and some complex class libraries are not often used, so I want to implement an interpreter that does not use global interpreter locks and can have multiple Python interpreter locks. So at the end of 16 years, I implemented the first version of Python interpreter myself. The first version is directly parsed using AST virtual grammar tree. Although the necessary optimization has been made, the performance is... Still can't bear to look straight. Usually I have been Tucao Python running Lua is not fast, but Tucao is the same thing, their realization is really another thing. I carefully analyzed the reason for the low performance of the first edition is the wrong way! Python's virtual machine is to translate grammar tree into ByteCode, and then a Virtual Machine constantly explains bytecode, while the operation of VM is divided into stack mode and register mode. Python is stack mode, while Lua is register mode, register mode is the current trend, which is also an important reason why Lua runs faster. My first VM ran directly with AST and took the wrong path. It was too fast anyway. But I still branched this first edition and shared it, because when I implemented VM in register mode, I felt that I could not design it as elegant and direct as AST directly parsed VM. The way of AST direct parsing is really intuitive. Although the efficiency is very low, it still has great application value. For example, such tools as protocolbuff and thrift, which generate code by defining grammar files, do not require high efficiency of parsing, so this version of VM has great reference value in these areas.
Internal implementation level:

Python BNF

When it comes to implementing script interpreters, it is estimated that many people will scratch their heads and wonder where to start. At first, I did the same thing. I turned the compiler's principle from the pile under the bed into the cold palace and looked at it. But I still haven't found a big clue. Later, I browsed on python.org and downloaded python's source code analysis. The source directory has python's BNF description file. Because I have seen the compilation principle once, BNF can read it very well. After reading it from beginning to end, it's brilliant. BNF is a complete process description for parsing Python grammar. Intercept a short paragraph to explain:

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
while_stmt: 'while' test ':' suite ['else' ':' suite]
for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]
try_stmt: ('try' ':' suite
         ((except_clause ':' suite)+
          ['else' ':' suite]
          ['finally' ':' suite] |
         'finally' ':' suite))
with_stmt: 'with' with_item (',' with_item)*  ':' suite
with_item: test ['as' expr]
# NB compile.c makes sure that the default except clause is last
except_clause: 'except' [test [('as' | ',') test]]
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

Simply explained, python's Grammar BNF is described recursively from the top down. The top-most definition is compound_stmt complex statement, while compound_stmt has several definitions such as if, while, for, try, with, function definition, class definition, modifier definition. Next, the grammatical rules of if_stmt are defined so that when c++ implements Python parsing, it can try to parse from top to bottom according to this BNF if it does not meet the BNF grammatical requirements. If you make a mistake, you will make a mistake. I wrote a python script to parse the BNF to automatically generate the C++ parsing function in order to generate the same code structure as the BNF. The generated C++ code example is as follows:

class Parser{
public:
  ExprASTPtr parse(Scanner& scanner);

  //! single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
  ExprASTPtr parse_single_input();
  //! file_input: (NEWLINE | stmt)* ENDMARKER
  ExprASTPtr parse_file_input();
  //! eval_input: testlist NEWLINE* ENDMARKER
  ExprASTPtr parse_eval_input();
  //! decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE
  ExprASTPtr parse_decorator();
  //! decorators: decorator+
  ExprASTPtr parse_decorators();
  //! decorated: decorators (classdef | funcdef)
  ExprASTPtr parse_decorated();
  //! funcdef: 'def' NAME parameters ':' suite
  ExprASTPtr parse_funcdef();
  //! parameters: '(' [varargslist] ')'
  ExprASTPtr parse_parameters();
  //! varargslist: ((fpdef ['=' test] ',')*
  //!               fpdef ['=' test] (',' fpdef ['=' test])* [','])
  ExprASTPtr parse_varargslist();
  //! fpdef: NAME | '(' fplist ')'
  ExprASTPtr parse_fpdef();
  //! fplist: fpdef (',' fpdef)* [',']
  ExprASTPtr parse_fplist();
  //! stmt: simple_stmt | compound_stmt
  ExprASTPtr parse_stmt();
  //! simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
  ExprASTPtr parse_simple_stmt();
  //! small_stmt: (expr_stmt | print_stmt  | del_stmt | pass_stmt | flow_stmt |
  //!              import_stmt | global_stmt | exec_stmt | assert_stmt)
  ExprASTPtr parse_small_stmt();
  //! expr_stmt: testlist (augassign (yield_expr|testlist) |
  ExprASTPtr parse_expr_stmt();
.................................

Implementation of Scanner

scanner is responsible for parsing python code, separating python code from each Token object, and Token object is defined as follows:

struct Token{
  Token():nTokenType(0), nVal(0), fVal(0.0), nLine(0){
  }

  std::string dump() const;
  int             nTokenType;
  int64_t         nVal;
  double          fVal;
  std::string     strVal;
  int             nLine;
};

enum ETokenType {
    TOK_EOF = 0,
    //TOK_DEF = -2,
    TOK_VAR = -4,
    TOK_INT = -5,
    TOK_FLOAT = -6,
    TOK_STR = -7,
    TOK_CHAR = -8,
};

nTokenType is defined as an enumeration of ETokenType. Scanner scans only Python code, but does not parse the grammar. All Python code is parsed into either integers, floating-point numbers or strings. This is different from the native python. The native Python digital object can express any number, but in order to achieve simple, simplified processing is done. This is also a reference to the implementation of lua. Every token object records its line number, which is convenient for grammar error reporting to provide useful information.
Specific scanner implementation is not posted, interested can go to see the source code, or relatively simple.

Implementation of Parser

Parser's header file is automatically generated by script parsing BNF. It is responsible for the token list parsed by scanner and constructed into AST according to BNF rules. The AST node object is defined as ExprAST:

class ExprAST {
public:
    ExprAST(){
    }
    virtual ~ExprAST() {}
    virtual PyObjPtr& eval(PyContext& context) = 0;

    unsigned int getFieldIndex(PyContext& context, PyObjPtr& obj);

    virtual PyObjPtr& getFieldVal(PyContext& context);
    virtual PyObjPtr& assignVal(PyContext& context, PyObjPtr& v){
        PyObjPtr& lval = this->eval(context);
        lval = v;
        return lval;
    }
    virtual void delVal(PyContext& context){
        PyObjPtr& lval = this->eval(context);
        lval = NULL;
    }

    virtual int getType() {
        return 0;
    }

public:
    std::string name;
    ExprLine    lineInfo;
    //std::vector<std::vector<int> >  module2objcet2fieldIndex;
    std::vector<int>                  module2objcet2fieldIndex;
};
class PyObj {
public:
    RefCounterData* getRefData(){
        return &refdata;
    }
    void release();
    typedef PySmartPtr<PyObj> PyObjPtr;
    PyObj():m_pObjIdInfo(NULL), handler(NULL){}
    virtual ~PyObj() {}

    int getType() const;
    virtual int getFieldNum() const { return m_objStack.size(); }
    static std::string dump(PyContext& context, PyObjPtr& self, int preBlank = 0);

    virtual PyObjPtr& getVar(PyContext& c, PyObjPtr& self, ExprAST* e);
    virtual const ObjIdInfo& getObjIdInfo() = 0;

    void clear(){
        m_objStack.clear();
    }
    inline PyObjHandler* getHandler() { return handler; }
    inline const PyObjHandler* getHandler() const { return handler; }
public:
    std::vector<PyObjPtr>    m_objStack;
    ObjIdInfo*               m_pObjIdInfo;
    PyObjHandler*            handler;
    RefCounterData           refdata;
};
typedef PyObj::PyObjPtr PyObjPtr;

ExprAST abstracts several operations of an AST node. The most important is the eval operation. For example, the evaluation of 100 is 100, and the evaluation of'abc'is the string'abc', which generates the corresponding value object. Each value object inherits PyObj. Each PyObj defines the ObjHander interface to implement various operations of python objects, such as +, -, / and so on. Different python value objects respond to different operations. This uses the polymorphism of c++.

class PyObjHandler{
public:
  virtual ~PyObjHandler(){}

  virtual int getType() const = 0;

  virtual std::string handleStr(PyContext& context, const PyObjPtr& self) const;
  virtual std::string handleRepr(PyContext& context, const PyObjPtr& self) const;
  virtual int handleCmp(PyContext& context, const PyObjPtr& self, const PyObjPtr& val) const;
  virtual bool handleBool(PyContext& context, const PyObjPtr& self) const;
  virtual bool handleEqual(PyContext& context, const PyObjPtr& self, const PyObjPtr& val) const;
  virtual bool handleLessEqual(PyContext& context, const PyObjPtr& self, const PyObjPtr& val) const;
  virtual bool handleGreatEqual(PyContext& context, const PyObjPtr& self, const PyObjPtr& val) const;
  virtual bool handleContains(PyContext& context, const PyObjPtr& self, const PyObjPtr& val) const;

  virtual bool handleLess(PyContext& context, const PyObjPtr& self, const PyObjPtr& val) const;
  virtual bool handleGreat(PyContext& context, const PyObjPtr& self, const PyObjPtr& val) const;

  virtual PyObjPtr& handleAdd(PyContext& context, PyObjPtr& self, PyObjPtr& val);
  virtual PyObjPtr& handleSub(PyContext& context, PyObjPtr& self, PyObjPtr& val);
  virtual PyObjPtr& handleMul(PyContext& context, PyObjPtr& self, PyObjPtr& val);
  virtual PyObjPtr& handleDiv(PyContext& context, PyObjPtr& self, PyObjPtr& val);
  virtual PyObjPtr& handleMod(PyContext& context, PyObjPtr& self, PyObjPtr& val);


  virtual PyObjPtr& handleIAdd(PyContext& context, PyObjPtr& self, PyObjPtr& val);
  virtual PyObjPtr& handleISub(PyContext& context, PyObjPtr& self, PyObjPtr& val);
  virtual PyObjPtr& handleIMul(PyContext& context, PyObjPtr& self, PyObjPtr& val);
  virtual PyObjPtr& handleIDiv(PyContext& context, PyObjPtr& self, PyObjPtr& val);
  virtual PyObjPtr& handleIMod(PyContext& context, PyObjPtr& self, PyObjPtr& val);

  virtual PyObjPtr& handleCall(PyContext& context, PyObjPtr& self, std::vector<ArgTypeInfo>& allArgsVal,
                               std::vector<PyObjPtr>& argAssignVal);
  virtual size_t    handleHash(PyContext& context, const PyObjPtr& self) const;
  virtual bool handleIsInstance(PyContext& context, PyObjPtr& self, PyObjPtr& val);
  virtual long handleLen(PyContext& context, PyObjPtr& self);
  virtual PyObjPtr& handleSlice(PyContext& context, PyObjPtr& self, PyObjPtr& startVal, int* stop, int step);
  virtual PyObjPtr& handleSliceAssign(PyContext& context, PyObjPtr& self, PyObjPtr& k, PyObjPtr& v);
  virtual void handleSliceDel(PyContext& context, PyObjPtr& self, PyObjPtr& k){}

  virtual void handleRelese(PyObj* data);
};

Implementation of Python Library

The python libraries implemented are listed as follows:

  1. list dict tuple copy string
  2. datetime
  3. json
  4. math
  5. os
  6. random
  7. open stringio
  8. struct
  9. sys
  10. weak

summary

Spython is a small python, originally intended to implement the simplest version of the python interpreter, and later implemented more smoothly, one breath of commonly used Python libraries are implemented. The most successful part of spython is the analysis and execution of ast. The code structure is clear and clear according to the process of bnf, which is very straightforward. There are two main shortcomings. First, grammatical errors or too simple, not friendly enough. The second is that the performance can not reach the performance of native python. As mentioned earlier, register-based VM must be implemented in order to reach or even exceed the level of native python. This has already been done, and will not release code for the time being. Let's wait until it's almost formed before releasing it.
Code address: https://git.oschina.net/ownit/spython
Build: make directly under Linux, dev c++ is needed under win.

Posted by Joeddox on Sun, 07 Apr 2019 12:42:30 -0700