Development and Use of UDF by hive

Keywords: Big Data hive Apache Hadoop log4j

Recently, there is a need for data mining, which requires statistics of the number of things given in the vicinity of longitude and latitude n kilometers. When it comes to calculating the distance between two points of the earth, UDF should be written to calculate the distance between two points of the earth.

I. UDF Writing

According to latitude and longitude to calculate the distance between two points, there are many calculation methods on the Internet, try several, found this article. Blog The accuracy of the method is relatively small, and his analysis method is very detailed. Finally, this method is adopted.

import com.ai.hive.udf.topdomain.StringUtil;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import org.apache.log4j.Logger;

/**
 * Function: Calculate the distance between two points according to their longitude and latitude
 * create temporary function LocDistanceCalUDF as 'com.ai.hive.udf.util.LocDistanceCalUDF';
 * @author olicity
 */
public class LocDistanceCalUDF extends UDF{
    private static Logger log = Logger.getLogger(LocDistanceCalUDF.class);

    private Text nullText = new Text("");
    /**
    *Calculating the Distance between Two Points of the Earth Based on Longitude and Latitude
    */
    private static double distanceCal(double lng1, double lat1,double lng2,double lat2){
        double dx = lng1 - lng2;// Longitude difference
        double dy= lat1 - lat2;// Latitudinal difference
        double b = (lat1 + lat2) / 2.0;// Average latitude
        double Lx = Math.toRadians(dx)*6367000.0*Math.cos(Math.toRadians(b));// East-West distance
        double Ly = 6367000.0*Math.toRadians(dy);// North South distance
        return Math.sqrt(Lx*Lx+Ly*Ly);// Calculating the Total Distance (m) by Using the Formula of Rectangular Diagonal Distance in Plane
    }
    /**
    *Rewriting evaluate method
    */
    public Text evaluate(Text longitudeText1, Text latitudeText1,Text longitudeText2, Text latitudeText2){
        try{
            if(longitudeText1==null || latitudeText1==null || longitudeText2==null || latitudeText2==null){
                return nullText;
            }
            if(StringUtil.isEmpty(longitudeText1.toString()) || StringUtil.isEmpty(latitudeText1.toString()) || StringUtil.isEmpty(longitudeText2.toString()) || StringUtil.isEmpty(latitudeText2.toString())){
                return nullText;
            }
            double lng1 = Double.valueOf(longitudeText1.toString());
            double lat1 = Double.valueOf(latitudeText1.toString());
            double lng2 = Double.valueOf(longitudeText2.toString());
            double lat2 = Double.valueOf(latitudeText2.toString());

            double dis = distanceCal(lng1,lat1,lng2,lat2);
            return new Text(String.valueOf(dis));
        }catch (Exception e){
            return nullText;
        }

    }
    /**
    *Rewriting evaluate method
    */
    public Text evaluate(Text locationA,Text locationB){
        try{
            if (locationA==null||locationB==null){
                return nullText;
            }
            if(StringUtil.isEmpty(locationA.toString()) || StringUtil.isEmpty(locationB.toString())){
                return nullText;
            }
            String locationA2String  = locationA.toString();
            String locationB2String  = locationB.toString();
            double lng1 = Double.valueOf(locationA2String.split(",")[0]);
            double lat1 = Double.valueOf(locationA2String.split(",")[1]);
            double lng2 = Double.valueOf(locationB2String.split(",")[0]);
            double lat2 = Double.valueOf(locationB2String.split(",")[1]);

            double dis = distanceCal(lng1,lat1,lng2,lat2);
            return new Text(String.valueOf(dis));
        }catch(Exception e){
            return nullText;
        }
    }

}

UDF classes inherit the org.apache.hadoop.hive.ql.exec.UDF class, which implements evaluation. When we use a custom UDF in hive, hive calls the evaluate method in the class to implement specific functions.

II. UDF Import

1.jar package upload

Right-click the class name, Copy reference, and copy the full path to com.ai.hive.udf.util.LocDistance CalUDF. Type the finished classes into jar packages and upload them to the server. Paths such as: / user/olicity/hive/UDF

2. The jar package is introduced into the classpath variable

Enter hive, introduce jar packages, and execute commands

add jar /user/olicity/hive/UDF/udf.jar;

View the imported jar package

list jars;

3. Create functions

Create a temporary function called LocDistance CalUDF to associate the jar package

create temporary function LocDistanceCalUDF as 'com.ai.hive.udf.util.LocDistanceCalUDF';

View the created functions

show functions;

4. pay attention to

_The above method is limited to the current session, if a permanent function corresponding to the permanent path needs to be added,

create function locUDF.LocDistanceCalUDF 
  as 'com.ai.hive.udf.util.LocDistanceCalUDF' 
  using jar 'hdfs://hdfs path / udf.jar';
use LocDistanceCalUDF;

You need to put the jar package on hdfs, and then create the function correlation path.
You've also seen another way to configure hive.aux.jars.path in the hive-site.xml file

The configuration reference is as follows:
   <property>
       <name>hive.aux.jars.path</name>
       <value>file:///home/hdfs/fangjs/DefTextInputFormat.jar,file:///jarpath/test.jar</value>
   </property>

III. Use of UDF

_Preparations are in place, so we can start looking up the tables. emmmmm, just look up the two tables simply, the table structure and table data are not shown, the sample table is not built, the longitude and latitude table given is called table A, the table that needs to be queried is called table B, the temporary intermediate table is called table c, the field definition of longitude and latitude table is loc, even if the distance is 2 kilometers.

create table C
as
select B.* from B join A where (LocDistanceCalUDF(A.loc,B.loc)<=2000);

OK.

Four, summary

_More practice is needed on sql statements, especially multi-table joint checking. Hadoop has a long way to go.

Posted by HairyScotsman on Sun, 20 Jan 2019 08:12:12 -0800