[software engineering practice] Hive research - Blog9

Keywords: Big Data Hadoop hive

[software engineering practice] Hive research - Blog9

2021SC@SDUSC

Research content introduction

I am responsible for converting the query block QB into a logical query plan (OP Tree)
The following code comes from apaceh-hive-3.1.2-src/ql/src/java/org/apache/hadoop/hive/ql/plan, which is my analysis object code. In the previous Hive research - Blog1-8, we have completed all the code parsing under the mapper folder. Starting this week, we will study the source code in the next folder, PTFE. This week's task is to study the BoundaryDef.java file in the PTFE folder.

Code analysis of BoundaryDef.java file

We first attach the entire java file code

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.hive.ql.plan.ptf;

import org.apache.hadoop.hive.ql.parse.WindowingSpec.BoundarySpec;
import org.apache.hadoop.hive.ql.parse.WindowingSpec.Direction;

public class BoundaryDef {
  Direction direction;
  private int amt;
  private final int relativeOffset;

  public BoundaryDef(Direction direction, int amt) {
    this.direction = direction;
    this.amt = amt;

    // Calculate relative offset
    switch(this.direction) {
    case PRECEDING:
      relativeOffset = -amt;
      break;
    case FOLLOWING:
      relativeOffset = amt;
      break;
    default:
      relativeOffset = 0;
    }
  }

  public Direction getDirection() {
    return direction;
  }

  /**
   * Returns if the bound is PRECEDING.
   * @return if the bound is PRECEDING
   */
  public boolean isPreceding() {
    return this.direction == Direction.PRECEDING;
  }

  /**
   * Returns if the bound is FOLLOWING.
   * @return if the bound is FOLLOWING
   */
  public boolean isFollowing() {
    return this.direction == Direction.FOLLOWING;
  }

  /**
   * Returns if the bound is CURRENT ROW.
   * @return if the bound is CURRENT ROW
   */
  public boolean isCurrentRow() {
    return this.direction == Direction.CURRENT;
  }

  /**
   * Returns offset from XX PRECEDING/FOLLOWING.
   *
   * @return offset from XX PRECEDING/FOLLOWING
   */
  public int getAmt() {
    return amt;
  }

  /**
   * Returns signed offset from XX PRECEDING/FOLLOWING. Negative for preceding.
   *
   * @return signed offset from XX PRECEDING/FOLLOWING
   */
  public int getRelativeOffset() {
    return relativeOffset;
  }


  public boolean isUnbounded() {
    return this.getAmt() == BoundarySpec.UNBOUNDED_AMOUNT;
  }

  public int compareTo(BoundaryDef other) {
    int c = getDirection().compareTo(other.getDirection());
    if (c != 0) {
      return c;
    }

    return this.direction == Direction.PRECEDING ? other.amt - this.amt : this.amt - other.amt;
  }

  @Override
  public String toString() {
    if (direction == null) return "";
    if (direction == Direction.CURRENT) {
      return Direction.CURRENT.toString();
    }

    return direction + "(" + (getAmt() == Integer.MAX_VALUE ? "MAX" : getAmt()) + ")";
  }
}

Parsing started.

Brief analysis of global variables

Let's first look at the global variables.

  Direction direction;
  private int amt;
  private final int relativeOffset;

Let's take a look first. What kind of class is this Direction class? We first searched the web page and found that we couldn't find the relevant content. The keywords we searched from "division in Java" to "Direction in Hive" can't find the content we want. We might as well change our thinking: observe the introduced package to see if it contains this class. Sure enough, we saw the following statement in the opening statement:
import org.apache.hadoop.hive.ql.parse.WindowingSpec.Direction;, This is obviously an official interface of Hive, so we went to the Apache official website to check the content corresponding to this interface. We found this content on the official website: connect
Here is the content document we need to view. Let's take a look at the inherited class: Enum < windowingspec. Direction >. Here is a very key message: the keyword Enum. In Java, Enum is a class that contains custom objects for easy reference without the risk of parameter setting errors to construct a new object. We can take a look at one of its uses:

public enum WeekDay { 
     Mon("Monday"), Tue("Tuesday"), Wed("Wednesday"), Thu("Thursday"), Fri( "Friday"), Sat("Saturday"), Sun("Sunday"); 
     private final String day; 
     private WeekDay(String day) { 
            this.day = day; 
     } 
    public static void printDay(int i){ 
       switch(i){ 
           case 1: System.out.println(WeekDay.Mon); break; 
           case 2: System.out.println(WeekDay.Tue);break; 
           case 3: System.out.println(WeekDay.Wed);break; 
           case 4: System.out.println(WeekDay.Thu);break; 
           case 5: System.out.println(WeekDay.Fri);break; 
           case 6: System.out.println(WeekDay.Sat);break; 
           case 7: System.out.println(WeekDay.Sun);break; 
           default:System.out.println("wrong number!"); 
         } 
     } 
    public String getDay() { 
        return day; 
     } 
}

In the underlying logic, the above Enum type Enum parameter WeekDay will get the following after decompilation:

public final class WeekDay extends java.lang.Enum{ 
    public static final WeekDay Mon; 
    public static final WeekDay Tue; 
    public static final WeekDay Wed; 
    public static final WeekDay Thu; 
    public static final WeekDay Fri; 
    public static final WeekDay Sat; 
    public static final WeekDay Sun; 
    static {}; 
    public static void printDay(int); 
    public java.lang.String getDay(); 
    public static WeekDay[] values(); 
    public static WeekDay valueOf(java.lang.String); 
}

We can see that the compiled Enum type variable is equivalent to a class, which is very convenient. In addition, it can also rewrite tostring methods and other methods, as well as custom getter and setter methods, which are very easy to use.

Let's move on to the explanation given by the official API. This WindowingSpec.Direction contains three constants: current, following, and forecasting. There are two static methods in this enumeration class. The first method is valueOf(String name), and the return value is an instantiated object. This method returns instances in the enumeration type, which can be found by using the name passed in when calling this method as an index; The second method, values(), returns all the instantiated objects contained, which are returned in the form of an array.

values

Returns a WindowingSpec.Direction array in the defined order of initialization. The calling method is WindowingSpec.Direction.values()

valueOf

Returns an instantiated concrete object, determined by the passed in parameter name.

Since there are still many unknown parameter types and methods in the BoundartDef file, we might as well parse the contents of another import file org.apache.hadoop.hive.ql.parse.WindowingSpec.BoundarySpec, and then start parsing the source code of the whole file.

Similarly, we can find the description of this class on the official website: link

We found that this is an abstract class and the implemented interface is Comparable. Of course, since it is an abstract class, some methods have not been implemented, but the implemented methods are enough for us. This class has four methods.

Method getAmt() and method setAmt(int amt)
You can see from the method name that this is a pair of get and setter methods. What parameter is this amt? We might as well look at the global variable and find a parameter named unbounded_ Global variable of type int of amount. Then amt is its abbreviation. Then the get and set methods are the get and set methods for them.

Methods getDirection() and setDirection(WindowingSpec.Direction dir)
Similarly, we have explained what kind of WindowingSpec.Direction is above, so there is no redundant explanation here. This get and set method is to set and get a variable of WindowingSpec.Direction type.

Back to the source code of BoundaryDef.java, we can know the meaning of the global variable direction and that amt belongs to BoundarySpec. We don't know the function of the final relativeOffset yet. We can continue to observe it below.

So far, we have explained all the imported packages and the global variables of BoundaryDef.java. Now we have done enough preliminary work and can officially start parsing.

Construction method BoundaryDef

  public BoundaryDef(Direction direction, int amt) {
    this.direction = direction;
    this.amt = amt;

    // Calculate relative offset
    switch(this.direction) {
    case PRECEDING:
      relativeOffset = -amt;
      break;
    case FOLLOWING:
      relativeOffset = amt;
      break;
    default:
      relativeOffset = 0;
    }
  }

Setting the values of two global variables at the beginning as the passed in parameters is a conventional construction class method operation. The next step is a switch statement, which is equivalent to a collection of multiple if s. When the direction parameter is predicting, set the value of relativeOffset to - amt; When the direction parameter is FOLLOWING, set relativeOffset to amt; When the direction parameter is other values (this also includes the case where the direction value is CURRENT). At this point, the whole construction method ends.

Method getDirection

  public Direction getDirection() {
    return direction;
  }

This is a simple getter method. The returned parameter is the global variable direciton.

Method ispreceeding

  /**
   * Returns if the bound is PRECEDING.
   * @return if the bound is PRECEDING
   */
  public boolean isPreceding() {
    return this.direction == Direction.PRECEDING;
  }

This method is used to judge whether the direction instantiates an object for forecasting. It is worth noting that the equals method cannot be used for the comparison method of Enum variables, because they are not strings, and the = = method must be used for judgment.

Method isFollowing

  /**
   * Returns if the bound is FOLLOWING.
   * @return if the bound is FOLLOWING
   */
  public boolean isFollowing() {
    return this.direction == Direction.FOLLOWING;
  }

Similar to the previous method, this method returns whether the global variable is equal to the bolean value of Direction.FOLLOWING.

Method isCurrentRow

  /**
   * Returns if the bound is CURRENT ROW.
   * @return if the bound is CURRENT ROW
   */
  public boolean isCurrentRow() {
    return this.direction == Direction.CURRENT;
  }

Similar to the previous two methods, judge whether the global variable is equal to the boolean value of direction.current.

Method getAmt

  public int getAmt() {
    return amt;
  }

This is a getter method used to get the value of the global variable amt.

Method getRelativeOffset

  /**
   * Returns signed offset from XX PRECEDING/FOLLOWING. Negative for preceding.
   *
   * @return signed offset from XX PRECEDING/FOLLOWING
   */
  public int getRelativeOffset() {
    return relativeOffset;
  }

This is also a getter method to get the value of RelativeOffset.

Method isUnbounded

  public boolean isUnbounded() {
    return this.getAmt() == BoundarySpec.UNBOUNDED_AMOUNT;
  }

This is to judge whether our global variable amt is consistent with the global variable unbounded in BoundarySpec_ Whether amount is equal, and then return the boolean value.

Method CompareTo

  public int compareTo(BoundaryDef other) {
    int c = getDirection().compareTo(other.getDirection());
    if (c != 0) {
      return c;
    }

    return this.direction == Direction.PRECEDING ? other.amt - this.amt : this.amt - other.amt;
  }

Let's take a look at the compareTo method first. We know that the compareTo() method is used to compare the Number object with the parameters of the method. It can be used to compare Byte, Long, Integer, etc. This method is used for the comparison of two same data types. Two different types of data cannot be compared with this method.

grammar

public int compareTo( NumberSubClass referenceName )

parameter
referenceName – can be a Byte, Double, Integer, Float, Long or Short parameter.

Return value

Returns 0 if the specified number is equal to the parameter.

Returns - 1 if the specified number is less than the parameter.

Returns 1 if the specified number is greater than the parameter.

Let's take an example to intuitively understand the usage of this function:

public class Test{ 
   public static void main(String args[]){
      Integer x = 5;
      System.out.println(x.compareTo(3));
      System.out.println(x.compareTo(5));
      System.out.println(x.compareTo(8));            
     }
}

output

1
0
-1

Let's look at the assignment statement of c. First, the getDirection method is called to obtain the global variable direction, which is compared with the direction of the incoming parameter, and then the return value is obtained and assigned to c. Then judge if c is not 0, that is, the global variable direction is not equal to the direction of the incoming parameter, and return c. If c is equal to 0, it means they are equal. Judge whether the global variable direction is equal to the instance of forecasting, and return the difference of amt between the two. The specific who is minus and who is minus depends on the judged true value and false value.

Method tostring

  @Override
  public String toString() {
    if (direction == null) return "";
    if (direction == Direction.CURRENT) {
      return Direction.CURRENT.toString();
    }

    return direction + "(" + (getAmt() == Integer.MAX_VALUE ? "MAX" : getAmt()) + ")";
  }

This overrides the toString method. If the global variable direction is null, an empty string is returned. If the direction is the CURRENT instance, the toString method in the CURRENT instance is executed. If not, judge whether the value of amt is consistent with MAX_ Values are equal. If they are equal, MAX is returned. If they are not equal, amt is returned.

So far, all the BoundartDef.java files have been parsed.

Code analysis of PTFExpressionDef.java file

We will then parse a java file and attach the source code first

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.hive.ql.plan.ptf;

import org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator;
import org.apache.hadoop.hive.ql.plan.Explain;
import org.apache.hadoop.hive.ql.plan.Explain.Level;
import org.apache.hadoop.hive.ql.plan.ExprNodeDesc;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;

public class PTFExpressionDef {
  String expressionTreeString;
  ExprNodeDesc exprNode;
  transient ExprNodeEvaluator exprEvaluator;
  transient ObjectInspector OI;

  public PTFExpressionDef() {}

  public PTFExpressionDef(PTFExpressionDef e) {
    expressionTreeString = e.getExpressionTreeString();
    exprNode = e.getExprNode();
    exprEvaluator = e.getExprEvaluator();
    OI = e.getOI();
  }

  public String getExpressionTreeString() {
    return expressionTreeString;
  }

  public void setExpressionTreeString(String expressionTreeString) {
    this.expressionTreeString = expressionTreeString;
  }

  public ExprNodeDesc getExprNode() {
    return exprNode;
  }

  public void setExprNode(ExprNodeDesc exprNode) {
    this.exprNode = exprNode;
  }

  @Explain(displayName = "expr", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED })
  public String getExprNodeExplain() {
    return exprNode == null ? null : exprNode.getExprString();
  }

  public ExprNodeEvaluator getExprEvaluator() {
    return exprEvaluator;
  }

  public void setExprEvaluator(ExprNodeEvaluator exprEvaluator) {
    this.exprEvaluator = exprEvaluator;
  }

  public ObjectInspector getOI() {
    return OI;
  }

  public void setOI(ObjectInspector oI) {
    OI = oI;
  }
}

Start parsing.

Global variable resolution

  String expressionTreeString;
  ExprNodeDesc exprNode;
  transient ExprNodeEvaluator exprEvaluator;
  transient ObjectInspector OI;

Let's take a look at what kind of class the ExprNodeDesc class is for the first time. We observed the imported package and found the package org.apache.hadoop.hive.ql.plan.ExprNodeDesc; with the same name;, Then we found the resolution of this class in the official API:

When we need to use the methods or properties inside, we can compare and check the contents inside. Then comes the Java keyword "transient", which we first came into contact with. What is the role of this keyword? We learned from the data that the function of the transient keyword in Java is simply to prevent some modified member attribute variables from being serialized. So what is the definition of serialization in Java: serialization of objects in Java refers to converting objects into byte sequences. These byte sequences contain object data and information. A serialized object can be written to a database or file or used for network transmission. Generally, when we use cache (insufficient memory space may be stored locally to the hard disk) or remote call rpc (network transmission) We often need to make our entity classes implement the Serializable interface in order to make them Serializable. Of course, the ultimate purpose of serialization is to deserialize and restore them to the original Java objects. Otherwise, it is useless to get a pile of data that does not conform to the call format, so the serialized byte sequence can be restored to Java objects The process is deserialization. Under what circumstances do some fields of an object not need to be serialized? If there are the following situations, you can consider using the keyword transient modification: the field value in the class can be derived from other fields. For example, a rectangular class has three attributes: length, width and area, so the attribute of area does not exist during serialization It's necessary to be serialized. Finally, why not serialize? It's mainly to save storage space. Other feelings are not good, and there may be disadvantages (some fields may need to be recalculated and initialized). In general, the advantages outweigh the disadvantages. Let's take a simple example to intuitively feel this method:

package tmp;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;

class Rectangle implements Serializable{

    /**
     *
     */
    private static final long serialVersionUID = 1710022455003682613L;
    private Integer width;
    private Integer height;
    private transient Integer area;



    public Rectangle (Integer width, Integer height){
        this.width = width;
        this.height = height;
        this.area = width * height;
    }

    public void setArea(){
        this.area = this.width * this.height;
    }

    @Override
    public String toString(){
        StringBuffer sb = new StringBuffer(40);
        sb.append("width : ");
        sb.append(this.width);
        sb.append("\nheight : ");
        sb.append(this.height);
        sb.append("\narea : ");
        sb.append(this.area);
        return sb.toString();
    }
}

public class TransientExample{
    public static void main(String args[]) throws Exception {
        Rectangle rectangle = new Rectangle(3,4);
        System.out.println("1.Original object\n"+rectangle);
        ObjectOutputStream o = new ObjectOutputStream(new FileOutputStream("rectangle"));
        // Write object to stream
        o.writeObject(rectangle);
        o.close();

        // Read object from stream
        ObjectInputStream in = new ObjectInputStream(new FileInputStream("rectangle"));
        Rectangle rectangle1 = (Rectangle)in.readObject();
        System.out.println("2.Deserialized object\n"+rectangle1);
        rectangle1.setArea();
        System.out.println("3.Restore to original object\n"+rectangle1);
        in.close();
    }
}

Output:

1.Original object
width : 3
height : 4
area : 12
2.Deserialized object
width : 3
height : 4
area : null
3.Restore to original object
width : 3
height : 4
area : 12

In general, transient is used to save space, because hive deals with massive data, so saving space is a basic requirement that must be realized. After analyzing these, let's take a look at two new classes: ExprNodeEvaluator class and ObjectInspector class. Similarly, let's go directly to the apache official website to find the contents of these two classes. About expr Nodeevaluator class:

ObjectInspector class:

We wait until we need to use its methods or variables before carefully referring to parsing.

Class constructor method PTFExpressionDef

public PTFExpressionDef() {}
  public PTFExpressionDef(PTFExpressionDef e) {
    expressionTreeString = e.getExpressionTreeString();
    exprNode = e.getExprNode();
    exprEvaluator = e.getExprEvaluator();
    OI = e.getOI();
  }

There are two construction methods, one is an empty construction method, and the other is to pass in a PTFExpressionDef class parameter for assignment, and set all variables as the built-in variables of the passed in parameter.

getter and setter methods of parameter expressionTreeString

  public String getExpressionTreeString() {
    return expressionTreeString;
  }
  public void setExpressionTreeString(String expressionTreeString) {
    this.expressionTreeString = expressionTreeString;
  }

This is a common getter and setter method for getting and setting parameters.

getter and setter methods of parameter exprNode

  public ExprNodeDesc getExprNode() {
    return exprNode;
  }
  public void setExprNode(ExprNodeDesc exprNode) {
    this.exprNode = exprNode;
  }

This is a common getter and setter method for getting and setting parameters.

Method getExprNodeExplain

  @Explain(displayName = "expr", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED })
  public String getExprNodeExplain() {
    return exprNode == null ? null : exprNode.getExprString();
  }

The Explain statement here specifies the of the Explain class. This class has the displayName of the global variable String type and the explainLevels of the array type (the element type is Explain.Level). Let's take a look at the setting of the array:

Obviously, this is to set the array as the default specified value. Then, judge whether the exprNode variable is empty. If it is empty, it returns null. If it is not empty, call the method getExprString(). In the ExprNodeDesc class, this method returns a variable of String type.

getter and setter methods of parameter exprevaluator

  public ExprNodeEvaluator getExprEvaluator() {
    return exprEvaluator;
  }
  public void setExprEvaluator(ExprNodeEvaluator exprEvaluator) {
    this.exprEvaluator = exprEvaluator;
  }

This is a common getter and setter method for getting and setting parameters.

getter and setter methods of parameter OI

  public ObjectInspector getOI() {
    return OI;
  }
  public void setOI(ObjectInspector oI) {
    OI = oI;
  }

This is a common getter and setter method for getting and setting parameters.

So far, all the PTFExpressionDef.java files have been parsed.

Summary

Through this week's study, I have learned more about the underlying logic and have a deeper understanding of Hive. I hope to continue to learn new knowledge in the next week's study.

Posted by lilRachie on Sun, 28 Nov 2021 04:10:18 -0800