PostgreSQL -- Semantic Analysis

Keywords: PostgreSQL

2021SC@SDUSC

PostgreSQL code I am responsible for: query compilation and execution
Content of this blog: semantic analysis
In my last blog, I analyzed the lexical analysis and syntax analysis of query analysis. We can know that after lexical analysis and syntax analysis, we can get an analysis tree stored in SelectStmt structure. From the flow analysis diagram, we can know that we still need syntax analysis to get the query tree.

Semantic analysis: the semantic analysis phase will check whether there are components in the command that do not meet the semantic requirements and whether the command can be executed correctly.
In the semantic analysis phase, parse in analyze.c file is mainly responsible for semantic analysis_ Analyze function. parse_ The analyze function will generate a corresponding query tree according to the analysis tree, and then the query rewriting module will further modify the query tree and rewrite the query tree into a query tree linked list. In parse_ The analyze function is divided into seven cases according to the command type:

Important structures involved in semantic analysis:

Query: used to store the query tree and is the final output result of query analysis
ParseState: ParseState: used to record intermediate information for semantic analysis

struct ParseState
{
	struct ParseState *parentParseState;	//If it is currently a subquery, this field points to its outer query
	const char *p_sourcetext;	//The original sql command is only used to report the location of parsing errors
	List	   *p_rtable;		//Table involved in query -- range table
	List	   *p_joinexprs;	//Join expression
	List	   *p_joinlist;	//Connection item	
	List	   *p_namespace;	//Table name collection, used to check table name conflicts
	bool		p_lateral_active;	//Is there an associated reference
	List	   *p_ctenamespace;//A collection of public expression names
	List	   *p_future_ctes;	//Not at p_ Common expressions in ctenamespace
	CommonTableExpr *p_parent_cte;	/* this query's containing CTE */
	Relation	p_target_relation;	//Target table
	RangeTblEntry *p_target_rangetblentry;	/* target rel's RTE */
	bool		p_is_insert;	//Is it an insert statement
	List	   *p_windowdefs;	//Original definition form of window clause
	ParseExprKind p_expr_kind;	//Type of parameter
	int			p_next_resno;	
	List	   *p_multiassign_exprs;	
	List	   *p_locking_clause;	//locking clause
	bool		p_locked_from_parent;	
	bool		p_resolve_unknowns; //Unresolved types
	QueryEnvironment *p_queryEnv;	/* curr env, incl refs to enclosing env */

	bool		p_hasAggs;//Is there an aggregate function
	bool		p_hasWindowFuncs;//Is there a window function
	bool		p_hasTargetSRFs;//Is there an SRF
	bool		p_hasSubLinks;//Are there child links
	bool		p_hasModifyingCTE;//Modify CTE
	Node	   *p_last_srf;		
	/*
	 * Optional hook functions for parser callbacks.  These are null unless
	 * set up by the caller of make_parsestate.
	 */
	PreParseColumnRefHook p_pre_columnref_hook;
	PostParseColumnRefHook p_post_columnref_hook;
	ParseParamRefHook p_paramref_hook;
	CoerceParamHook p_coerce_param_hook;
	void	   *p_ref_hook_state;	/* common passthrough link for above */
};
typedef struct Query
{
	NodeTag		type;//Node type, T_Query

	CmdType		commandType;	//Command type
	QuerySource querySource;	//Is it an original query or a query from a rule
	uint64		queryId;		/* query identifier (can be set by plugins) */

	bool		canSetTag;		
	//Used in query rewriting. If the query is converted from the original query, this field is false; If query is rewritten by query or newly added during query planning, this field is true
	Node	   *utilityStmt;//Define cursors or non optimizable query statements
	int			resultRelation; //Result relationship
	bool		hasAggs;//Is there an aggregate function in the target clause or having clause
	bool		hasWindowFuncs; //Is there a window function in the target attribute
	bool		hasTargetSRFs;	/* has set-returning functions in tlist */
	bool		hasSubLinks;	//Whether there are sub queries
	bool		hasDistinctOn;	//distinct clause
	bool		hasRecursive;	//Whether recursion is allowed in public expressions
	bool		hasModifyingCTE;	/* has INSERT/UPDATE/DELETE in WITH */
	bool		hasForUpdate;	//Are there any updates
	bool		hasRowSecurity; /* rewriter has applied some RLS policy */

	List	   *cteList;		//The with clause is used for public expressions
	List	   *rtable;			//Range table
	FromExpr   *jointree;		//The connection tree describes the connections from and where
	List	   *targetList;		//Target attribute
	OverridingKind override;	/* OVERRIDING clause */
	OnConflictExpr *onConflict; /* ON CONFLICT DO [NOTHING | UPDATE] */
	List	   *returningList;	//returning clause
	List	   *groupClause;	//group clause

	List	   *groupingSets;	//Group sets clause

	Node	   *havingQual;		//having clause

	List	   *windowClause;	//Window function clause
	List	   *distinctClause; //distinct clause
	List	   *sortClause;		//order clause
	Node	   *limitOffset;	//offset clause
	Node	   *limitCount;		//limit clause
	List	   *rowMarks;		//Row mark linked list, used in for update and for share clauses
	Node	   *setOperations;	//Set operation (union, intersect, except)
	List	   *constraintDeps; 
	List	   *withCheckOptions;	
	/*
	 * The following two fields identify the portion of the source text string
	 * containing this query.  They are typically only populated in top-level
	 * Queries, not in sub-queries.  When not set, they might both be zero, or
	 * both be -1 meaning "unknown".
	 */
	int			stmt_location;	//Where the record begins
	int			stmt_len;		//Length of record string
} Query;

parse_analyze function

Function parse_analyze first generates a ParseState structure to record the state of semantic analysis, and then completes the process of semantic analysis by calling the function transformStmt. The function transformStmt will call corresponding functions for row processing according to different query types.

The transformStmt function includes:
transformInsertStmt function
transformDeleteStmt function
transformUpdateStmt function
transformSelectStmt function
transformDeclareCursorStmt function
transformExplainStmt function
others

Query* parse_analyze(
    Node* parseTree, const char* sourceText, Oid* paramTypes, int numParams, bool isFirstNode, bool isCreateView)
{
     // 
     ParseState* pstate = make_parsestate(NULL);
     ...
     // transformation
     query = transformTopLevelStmt(pstate, parseTree, isFirstNode, isCreateView);
     ...
    pfree_ext(pstate->p_ref_hook_state);
    free_parsestate(pstate);
     ...
     // Return query tree
     return query;
}

transformSelectStmt function

It can be seen from the above that there are many transformStmt functions. Different types correspond to different sql operation statements. Here I analyze the transformSelectStmt function, the most important select clause in sql query.

static Query *
transformSelectStmt(ParseState *pstate, SelectStmt *stmt)
//transformSelectStmt this function performs semantic analysis on the select query
//SelectStmt *stmt is the analysis tree returned by lexical analysis
//ParseState *pstate: used to record the state of semantic analysis
{
	Query	   *qry = makeNode(Query);//makeNode linked list initialization
	//The query tree is finally organized in the data structure of query
	Node	   *qual;//postgresql converts all types of pointers to node type pointers to facilitate parameter passing
	ListCell   *l;

	qry->commandType = CMD_SELECT;//The type corresponding to the query tree command. The command type is select

	//Handle with Clause
	if (stmt->withClause)//Judge whether there is a with Clause
	{
		qry->hasRecursive = stmt->withClause->recursive;//Gets whether the with clause in the expression supports recursion
		qry->cteList = transformWithClause(pstate, stmt->withClause);
		qry->hasModifyingCTE = pstate->p_hasModifyingCTE;
	}
	if (stmt->intoClause)//Judge whether there is an into. If there is an into, an error will be reported, because postgresql treats the into clause as a special case
		ereport(ERROR,
				(errcode(ERRCODE_SYNTAX_ERROR),
				 errmsg("SELECT ... INTO is not allowed here"),//Prompt: into is not allowed
				 parser_errposition(pstate,
									exprLocation((Node *) stmt->intoClause))));

	pstate->p_locking_clause = stmt->lockingClause;

	pstate->p_windowdefs = stmt->windowClause;//Store the window clause of the analysis tree in the form of linked list
	transformFromClause(pstate, stmt->fromClause);//Handle from clause
	qry->targetList = transformTargetList(pstate, stmt->targetList,
										  EXPR_KIND_SELECT_TARGET);
	//targetList attribute for storing query results
	
	markTargetListOrigins(pstate, qry->targetList);//Indicate the source of query results

	//Handling the where clause in a select expression
	qual = transformWhereClause(pstate, stmt->whereClause,
								EXPR_KIND_WHERE, "WHERE");

	//Handling the having clause in a select expression
	qry->havingQual = transformWhereClause(pstate, stmt->havingClause,
										   EXPR_KIND_HAVING, "HAVING");

	//Sort first. Because sorting results are required for both group processing and distinct clause processing, sort first and then filter groups
	//sort
	qry->sortClause = transformSortClause(pstate,
										  stmt->sortClause,
										  &qry->targetList,
										  EXPR_KIND_ORDER_BY,
										  false /* allow SQL92 rules */ );
    //grouping
	qry->groupClause = transformGroupClause(pstate,
											stmt->groupClause,
											&qry->groupingSets,
											&qry->targetList,
											qry->sortClause,
											EXPR_KIND_GROUP_BY,
											false /* allow SQL92 rules */ );
    //Judge whether there is distinct, that is, whether repeated results are allowed
	if (stmt->distinctClause == NIL)//If there is no distinct clause in the analysis tree
	{
		qry->distinctClause = NIL;
		qry->hasDistinctOn = false;
	}

	else if (linitial(stmt->distinctClause) == NULL)//The linitial function obtains the data of the first cell in the linked list
	{
		/* We had SELECT DISTINCT */
		//Determine whether the data obtained in the distinct clause is empty

		qry->distinctClause = transformDistinctClause(pstate,
													  &qry->targetList,
													  qry->sortClause,
													  false);
		qry->hasDistinctOn = false;
	}
	else
	{
		/* We had SELECT DISTINCT ON */
		qry->distinctClause = transformDistinctOnClause(pstate,
														stmt->distinctClause,
														&qry->targetList,
														qry->sortClause);
		qry->hasDistinctOn = true;
	}
	/* transform LIMIT */
	//Limit clause: limit the number of results to avoid full table scanning and improve query efficiency.
	//offset clause, select which line to start scanning from
	qry->limitOffset = transformLimitClause(pstate, stmt->limitOffset,
											EXPR_KIND_OFFSET, "OFFSET");
	qry->limitCount = transformLimitClause(pstate, stmt->limitCount,
										   EXPR_KIND_LIMIT, "LIMIT");

Limit can be used to force a SELECT statement to return a specified number of records
The high execution efficiency of Limit is stated under a specific condition: that is, the number of databases is large, but only part of the data needs to be queried.
The principle of high efficiency is to avoid full table scanning and improve query efficiency.

	//Process window functions to calculate values for a set of rows
qry->windowClause = transformWindowDefinitions(pstate,								   pstate->p_windowdefs,									   &qry->targetList);

Window function: window function, also known as analysis function, calculates values for a group of rows and returns a result for each row. This is different from aggregate functions; The aggregate function returns a result for a set of rows. The windowing function contains an OVER clause that defines the row window covering the row to be calculated. For each row, the system uses the selected row window as input to calculate the analysis function results and possibly aggregate them.

	//Resolve unresolved output columns to type text
	if (pstate->p_resolve_unknowns)//Determine whether there are unresolved output columns
		resolveTargetListUnknowns(pstate, qry->targetList);

	qry->rtable = pstate->p_rtable;//Process select the range of the target table to query
	qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
	//The connection tree handles the connection between from and where, and displays whether there are table connections in the from clause

	qry->hasSubLinks = pstate->p_hasSubLinks;//Check whether there are subqueries in from
	qry->hasWindowFuncs = pstate->p_hasWindowFuncs;//Check whether there is a window function in the objective function
	qry->hasTargetSRFs = pstate->p_hasTargetSRFs;//Check whether there are self-defined SRF functions
	qry->hasAggs = pstate->p_hasAggs;//Check whether there is an aggregate function in the target clause or the having clause

	foreach(l, stmt->lockingClause)//Check whether there are locking clauses: for update and for share clauses. If there are, lock them
	//Maintain consistency
	{
		transformLockingClause(pstate, qry,
							   (LockingClause *) lfirst(l), false);
	}

	assign_query_collations(pstate, qry);//Mark all expressions with collation information in a given query

	if (pstate->p_hasAggs || qry->groupClause || qry->groupingSets || qry->havingQual)
	    //Judge whether the aggregation function is used incorrectly or the grouping is wrong
		parseCheckAggregates(pstate, qry);

	return qry;
}

Each sublist of GROUPING SETS can specify one or more columns or expressions, which will be interpreted in the same way as they appear directly in the GROUP BY clause.

SRF: set return fountain. PostgreSQL supports SRF functions, that is, functions that return multiple rows of data. We can also define them ourselves. We only need returns setof type.

Structure definition and interpretation involved in transformSelectStmt function

struct ListCell//A node that stores the analysis tree
{
	union//Common body type. All members of the common body occupy the same memory. Modifying one member will affect all other members.
	{
		void	   *ptr_value;
		int			int_value;
		Oid			oid_value;//postgresql uses an unsigned integer to identify all objects and files
	}			data;
	//The above three variables share the same memory
	ListCell   *next;
};

#define linitial(l)				list_lfirst(head(l))
#define lfirst(lc)				((lc)->data.ptr_value)
//The linitial function obtains the data of the first cell in the linked list

summary

Through the lexical syntax analysis of the last blog and the syntax analysis of this one, I basically understand the execution process of query analysis and the corresponding functions and implementation.
Thank you for your criticism!

Posted by maxmjessop on Fri, 15 Oct 2021 01:25:03 -0700