Practical combat of user portrait project based on spark

Keywords: IT

 
Refer to the video tutorial:   
 **Big data dmp user Portrait System   **
Transferred from Qianfeng Teacher Wang Su

1. Introduction to user portrait project

1.1 what is user portrait

The so-called user portrait is to label the user and explain what kind of person the user is.

Specifically, after putting some labels on users, they are divided into different types according to the differences of users' goals, behaviors and views, and then typical features are extracted from each type and given names, photos, some demographic elements, scenes and other descriptions to form a character prototype.

The process is to abstract the customer information into the user portrait, and then abstract the cognition of the customer.

1.2 main dimensions of user portrait

Demographic attribute: who is the user (basic information such as gender, age, occupation, etc.)

Consumer demand: consumer habits and preferences

Purchasing power: income, purchasing power, purchasing frequency and channel

Hobbies: brand preference, personal interest

Social attributes: user activity scenarios (social media, etc.)

1.3 data type of user portrait

Data include dynamic data and static data. The so-called static data, such as gender and age, cannot be changed in a short time; Dynamic data is data related to short-term behavior. For example, today I want to buy a skirt and tomorrow I will go to see pants. This data feature is relatively changeable.

1.4 purpose of user portrait

Kill cooked, recommend (very many) [user portrait is an important data source of the recommendation system], marketing, customer service

Let users and enterprises win-win. Let users quickly find the goods they want, and let enterprises find people who pay for the products.

(1) Micro level

In product design, users' needs are described through user portraits.

In data application, it can be used for recommendation, search and risk control

Combine qualitative analysis and quantitative analysis to conduct data operation and user analysis

Precision marketing

(2) Macro level

Determine the strategic and tactical direction of development

Market segmentation and user clustering, market-oriented

(3) Portrait modeling and prediction

Subdivide population attributes: clarify who, what and why

Purchase Behavior Segmentation: provide key information such as market opportunities and market scale

Product demand segmentation: provide more differentiated and competitive product specifications and business value

Interest and attitude segmentation: provide group category portraits: channel strategy, pricing strategy, product strategy and brand strategy

1.5 steps of user portrait

(1) Determine the target of the portrait

In different product life cycles, or different ways of use, different goals and different needs for portraits, so it is necessary to clarify what the goals and needs are before portrait.

(2) Determine the dimensions of the desired user portrait

Determine the dimensions required by the user's portrait according to the objectives. For example, if you want to recommend goods, you need the factors that can affect the user's choice of goods as the portrait dimension. For example, the user dimension (the user's age and gender will affect the user's choice), the asset dimension (the user's income and other factors will affect the user's choice of price), the behavior dimension (what the user often sees recently should be what he wants to buy), and so on.

(3) Determine the level of the portrait

The more user portrait levels, the smaller the portrait granularity and the clearer the understanding of users. For example, the user dimension can be divided into new users and old users, and then into users' gender, age, etc. This needs to be divided according to the target requirements.

(4) Through the original data, machine learning algorithm is used to label users

Because the original data we get is some disorderly data, we need the algorithm to label users through some features

(5) The tag is transformed into the output of the business through the machine learning algorithm

Everyone will have many tags, which need to be further transformed into the understanding of users. Different weights need to be built for different labels to obtain the output of the business. For example, what kind of products will users with some labels like.

(6) The business generates data, which feeds the business and keeps a closed loop

1.6 common user portrait labels


2. System architecture

2.1 overall architecture (offline projects)

2.2 data processing flow (what to do)

ETL (Extract Tranform Load) is used to describe the process of data extraction, transformation and loading from the source end to the destination end;

ODS (Operational Data Store) operation data store. There is no change in the data of this layer, which directly follows the data structure and data of the peripheral system and is not open to the outside world; It is the temporary storage layer, which is the temporary storage area of interface data to prepare for data processing in the next step.

Main steps to be realized: ETL, report statistics (data analysis), generation of business district library, data tagging (core)

2.3 description of main data set (log data file to be analyzed)

  • For the integrated log data, one copy per day in json format (offline processing)
  • This data set integrates internal and external data, as well as bidding information (related to advertising)
  • There are many columns of data, close to 100
field explain
ip Real IP of the device
sessionid Session ID
advertisersid Advertiser ID
adorderid AD ID
adcreativeid Advertising creative ID (> = 200000: DSP, < 200000 OSS)
adplatformproviderid Advertising platform provider ID (> = 100000: RTB, < 100000: API)
sdkversionnumber SDK version number
adplatformkey Platform provider key
putinmodeltype For advertisers' delivery mode, 1: display volume delivery, 2: click volume delivery
requestmode Data request method (1: request, 2: display, 3: click)
adprice Advertising price
adppprice Platform provider price
requestdate Request time, format: yyyy-m-dd hh:mm:ss
appid Application id
appname apply name
uuid Unique identification of the device, such as imei or Android
device Device model, such as htc, iphone
client Device type (1: android 2: ios 3: wp)
osversion Device operating system version, such as 4.0
density The density of the standby screen is 0.75, 1 and 1.5 for android and 1 and 2 for IOS
pw Device screen width
ph Device screen height
provincename Name of the province where the equipment is located
cityname Name of the city where the equipment is located
ispid Operator id
ispname Operator name
networkmannerid Networking mode id
networkmannername Networking mode name
iseffective Valid ID (valid refers to those that can be charged normally) (0: invalid 1: valid)
isbilling Charge or not (0: not charged 1: charged)
adspacetype Advertising space type (1: banner 2: insert screen 3: full screen)
adspacetypename Advertising space type name (banner, insert screen, full screen)
devicetype Device type (1: mobile phone 2: tablet)
processnode Process node (1: request quantity kpi 2: valid request 3: advertisement request)
apptype Application type id
district Name of the county where the equipment is located
paymode For the payment mode of platform providers, 1: display volume delivery (CPM) 2: click volume delivery (CPC)
isbid rtb
bidprice rtb bidding price
winprice Successful rtb bidding price
iswin Successful bidding
cur values:usd|rmb, etc
rate exchange rate
cnywinprice rtb bidding successfully converted to RMB price
imei Mobile phone string code
mac Mobile phone MAC code
idfa Advertising code of mobile APP
openudid Apple device ID
androidid Android device ID
rtbprovince rtb Province
rtbcity rtb City
rtbdistrict rtb area
rtbstreet rtb Street
storeurl app market download address
realip Real ip
isqualityapp Preferred identification
bidfloor floor price
aw Width of advertising space
ah Height of advertising space
imeimd5 imei_md5
macmd5 mac_md5
idfamd5 idfa_md5
openudidmd5 openudid_md5
androididmd5 androidid_md5
imeisha1 imei_sha1
macsha1 mac_sha1
idfasha1 idfa_sha1
openudidsha1 openudid_sha1
androididsha1 androidid_sha1
uuidunknow uuid_ Unknown UUID ciphertext
userid Platform user id
iptype Indicates the ip library type. 1 is the point media ip library and 2 is the ip geographic information standard library of the Advertising Association. The default is 1
initbidprice Initial bid
adpayment Advertising consumption after conversion (6 decimal places reserved)
agentrate Agent profit margin
lomarkrate Agency profit margin
adxrate Media profit margin
title title
keywords keyword
tagid Advertising space identification (the value is the video ID number when the video traffic occurs)
callbackdate Callback time format: YYYY/mm/dd hh:mm:ss
channelid Channel ID
mediatype media type
email User email
tel User telephone number
sex User gender
age User age

3. Create the project [ key points of the second day ]

3.1 steps

  • Create a Maven project to process data

Maven manages all jar s used in the project
*Modify pom.xml file and add:

  • Define dependent version
  • Import dependency
  • Define profile
  • Create scala directories (in src and test respectively)
  • Create the package cn.itbigdata.dmp in the scala directory

  • Write a main program architecture (DMPApp)

  • Add configuration file dev/application.conf

  • Add the directory utils and the parameter resolution class ConfigHolder

3.2 modify pom.xml file [ key ]

  1. Set dependent version information
   <properties>
       <scala.version>2.11.8</scala.version>
       <scala.version.simple>2.11</scala.version.simple>
       <hadoop.version>2.6.1</hadoop.version>
       <spark.version>2.3.3</spark.version>
       <hive.version>1.1.0</hive.version>
       <fastjson.version>1.2.44</fastjson.version>
       <geoip.version>1.3.0</geoip.version>
       <geoip2.version>2.12.0</geoip2.version>
       <config.version>1.2.1</config.version>
   </properties>
   
  1. Import dependencies of the computing engine
        <!-- scala -->
           <dependency>
               <groupId>org.scala-lang</groupId>
               <artifactId>scala-library</artifactId>
               <version>${scala.version}</version>
           </dependency>
           <dependency>
               <groupId>org.scala-lang</groupId>
               <artifactId>scala-xml</artifactId>
               <version>2.11.0-M4</version>
           </dependency>
           <!-- hadoop -->
           <dependency>
               <groupId>org.apache.hadoop</groupId>
               <artifactId>hadoop-client</artifactId>
               <version>${hadoop.version}</version>
           </dependency>
           <!-- spark core -->
           <dependency>
               <groupId>org.apache.spark</groupId>
               <artifactId>spark-core_${scala.version.simple}</artifactId>
               <version>${spark.version}</version>
           </dependency>
           <!-- spark sql -->
           <dependency>
               <groupId>org.apache.spark</groupId>
               <artifactId>spark-sql_${scala.version.simple}</artifactId>
               <version>${spark.version}</version>
           </dependency>
           <!-- spark graphx -->
           <dependency>
               <groupId>org.apache.spark</groupId>
               <artifactId>spark-graphx_${scala.version.simple}</artifactId>
               <version>${spark.version}</version>
           </dependency>
   
  1. < font color = Red > import dependency of storage engine (can be omitted) < / font >
   <dependency>
       <groupId>org.apache.hive</groupId>
       <artifactId>hive-jdbc</artifactId>
       <exclusions>
           <exclusion>
               <groupId>org.apache.hive</groupId>
               <artifactId>hive-service-rpc</artifactId>
           </exclusion>
           <exclusion>
               <groupId>org.apache.hive</groupId>
               <artifactId>hive-service</artifactId>
           </exclusion>
       </exclusions>
       <version>${hive.version}</version>
   </dependency>
   
  1. Import tool dependency
   <!-- be used for IP Address translation (longitude, dimension) -->
   <dependency>
       <groupId>com.maxmind.geoip</groupId>
       <artifactId>geoip-api</artifactId>
       <version>${geoip.version}</version>
   </dependency>
   <dependency>
       <groupId>com.maxmind.geoip2</groupId>
       <artifactId>geoip2</artifactId>
       <version>${geoip2.version}</version>
   </dependency>

   <!-- Convert latitude and longitude to code -->
   <dependency>
       <groupId>ch.hsr</groupId>
       <artifactId>geohash</artifactId>
       <version>${geoip.version}</version>
   </dependency>

   <!-- scala analysis json -->
   <dependency>
    <groupId>org.json4s</groupId>
    <artifactId>json4s-jackson_${scala.version.simple}</artifactId>
       <version>3.6.5</version>
   </dependency>

   <!-- Manage profiles -->
   <dependency>
       <groupId>com.typesafe</groupId>
       <artifactId>config</artifactId>
       <version>${config.version}</version>
   </dependency>
   
  1. < font color = Red > Import compilation configuration (can be omitted) < / font >
      <build>
           <plugins>
               <plugin>
                   <groupId>org.apache.maven.plugins</groupId>
                   <artifactId>maven-compiler-plugin</artifactId>
                   <configuration>
                       <source>1.8</source>
                       <target>1.8</target>
                   </configuration>
               </plugin>

               <plugin>
                   <artifactId>maven-assembly-plugin</artifactId>
                   <executions>
                       <execution>
                           <phase>package</phase>
                           <goals>
                               <goal>single</goal>
                           </goals>
                       </execution>
                   </executions>
                   <configuration>
                       <archive>
                           <manifest>
                               <addClasspath>true</addClasspath>
                               <mainClass>cn.itcast.dmp.processing.App</mainClass>
                           </manifest>
                       </archive>
                       <descriptorRefs>
                           <descriptorRef>jar-with-dependencies</descriptorRef>
                       </descriptorRefs>
                   </configuration>
               </plugin>
           </plugins>
       </build>
   

3.3 creating scala code packages


1558338673059.png

  • Create a scala code package under src/main /

  • Create cn.itbigdata.dmp package in scala package

  • Create in cn.itbigdata.dmp package

  • beans (storage class definition)

  • etl (etl related processing)
  • Report (report processing)
  • tradingarea (Business District Library)
  • tags (tag processing)
  • customtrait (store interface definition)
  • utils (storage tool class)
  • Establish a data directory in the whole project to store the data to be processed

3.4 DmpApp main program (initialization part)

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

// The main program of the project, where relevant tasks are completed
object DmpApp {
  def main(args: Array[String]): Unit = {
    // initialization
    val conf: SparkConf = new SparkConf()
      .setMaster("local")
      .setAppName("DmpApp")
      .set("spark.worker.timeout", "600s")
      .set("spark.cores.max", "10")
      .set("spark.rpc.askTimeout", "600s")
      .set("spark.network.timeout", "600s")
      .set("spark.task.maxFailures", "5")
      .set("spark.speculation", "true")
      .set("spark.driver.allowMultipleContexts", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.buffer.pageSize", "8m")
      .set("park.debug.maxToStringFields", "200")


    val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()

    // close resource
    spark.close()
  }
}

3.5 explanation of spark related parameters

Parameter name Default value Define value
spark.worker.timeout 60 500
spark.network.timeout 120s 600s
spark.rpc.askTimeout spark.network.timeout 600s
spark.cores.max 10
spark.task.maxFailures 4 5
spark.speculation false true
spark.driver.allowMultipleContexts false true
spark.serializer org.apache.spark.serializer.JavaSerializer org.apache.spark.serializer.KryoSerializer
spark.buffer.pageSize 1M - 64M, system calculation 8M
  • Spark.worker.timeout: the heartbeat is not reported to the master for a long time due to network failure. After spark.worker.timeout (seconds), the master detects the worker exception, identifies it as DEAD, and removes the worker information and the above executor information;
  • Spark. Network. Timeout: the default timeout for all network interactions. Caused by the network or gc, the worker or executor does not receive heartbeat feedback from the executor or task. Increase the value of spark.network.timeout to 300(5min) or higher as appropriate;
  • spark.rpc.askTimeout: timeout of RPC call;
  • spark.cores.max: the maximum number of CPU cores that each application can apply for;
  • spark.task.maxFailures: when task execution fails, it will not directly cause the whole application to go down. It will only go down if spark.task.maxFailures fail after retries for times;
  • Spark.specification: speculative execution means that for a slow running task in a Stage, the task will be started again on the executors of other nodes. If one task instance runs successfully, the calculation result of the first completed task will be taken as the final result, and the instances running on other executors will be killed to speed up the running speed;
  • spark.driver.allowMultipleContexts: there is only one instance of SparkContext by default. If set to true, multiple instances are allowed;
  • spark.serializer: in the spark architecture, objects delivered in the network or cached in memory and hard disk need to be serialized [sent to tasks on the Executor; RDD to be cached (provided that serialization caching is used); broadcast variables; data caching in the shuffle process, etc.); The performance of the default Java serialization method is not high, and the number of bytes occupied after serialization is also large; Kryo's serialization library is also officially recommended. According to the official documents, the performance of kryo serialization mechanism is about 10 times higher than that of Java serialization mechanism;
  • spark.buffer.pageSize: the unit of spark memory allocation. There is no default value and the size is between 1M-64M. Spark is calculated according to the jvm heap memory size; The value is too small and the memory allocation efficiency is low; The value is too large, resulting in a waste of memory;

3.6 development environment parameter configuration file

application.conf

// Development environment parameter profile
# App information
spark.appname="dmpApp"

# spark information
spark.master="local[*]"
spark.worker.timeout="120"
spark.cores.max="10"
spark.rpc.askTimeout="600s"
spark.network.timeout="600s"
spark.task.maxFailures="5"
spark.speculation="true"
spark.driver.allowMultipleContexts="true"
spark.serializer="org.apache.spark.serializer.KryoSerializer"
spark.buffer.pageSize="8m"

# kudu information
kudu.master="node1:7051,node2:7051,node3:7051"

# Enter information about the data
addata.path="data/dataset_main.json"
ipdata.geo.path="data/dataset_geoLiteCity.dat"
qqwrydat.path="data/dataset_qqwry.dat"
installDir.path="data"

# Corresponding ETL output information
ods.prefix="ods"
ad.data.tablename="adinfo"

# The output report corresponds to 7 Analysis: regional statistics, advertising region, APP, equipment, network, operator and channel
report.region.stat.tablename="RegionStatAnalysis"
report.region.tablename="AdRegionAnalysis"
report.app.tablename="AppAnalysis"
report.device.tablename="DeviceAnalysis"
report.network.tablename="NetworkAnalysis"
report.isp.tablename="IspAnalysis"
report.channel.tablename="ChannelAnalysis"

# Gaode API
gaoDe.app.key="a94274923065a14222172c9b933f4a4e"
gaoDe.url="https://restapi.amap.com/v3/geocode/regeo?"

# Geohash (length of key)
geohash.key.length=10

# Business District Library
trading.area.tablename="tradingArea"

# tags
non.empty.field="imei,mac,idfa,openudid,androidid,imeimd5,macmd5,idfamd5,openudidmd5,androididmd5,imeisha1,macsha1,idfasha1,openudidsha1,androididsha1"
appname.dic.path="data/dic_app"
device.dic.path="data/dic_device"
tags.table.name.prefix="tags"

# Label attenuation factor
tag.coeff="0.92"

# es related parameters
es.cluster.name="cluster_es"
es.index.auto.create="true"
es.Nodes="192.168.40.164"
es.port="9200"
es.index.reads.missing.as.empty="true"
es.nodes.discovery="false"
es.nodes.wan.only="true"
es.http.timeout="2000000"

3.7 configuration file parsing class

// Parsing parameter file help class
import com.typesafe.config.ConfigFactory

object ConfigHolder {
  private val config = ConfigFactory.load()
  //  App Info
  lazy val sparkAppName: String = config.getString("spark.appname")

  // Spark parameters
  lazy val sparkMaster: String = config.getString("spark.master")

  lazy val sparkParameters: List[(String, String)] = List(
    ("spark.worker.timeout", config.getString("spark.worker.timeout")),
    ("spark.cores.max", config.getString("spark.cores.max")),
    ("spark.rpc.askTimeout", config.getString("spark.rpc.askTimeout")),
    ("spark.network.timeout", config.getString("spark.network.timeout")),
    ("spark.task.maxFailures", config.getString("spark.task.maxFailures")),
    ("spark.speculation", config.getString("spark.speculation")),
    ("spark.driver.allowMultipleContexts", config.getString("spark.driver.allowMultipleContexts")),
    ("spark.serializer", config.getString("spark.serializer")),
    ("spark.buffer.pageSize", config.getString("spark.buffer.pageSize"))
  )

  // kudu parameters
  lazy val kuduMaster: String = config.getString("kudu.master")

  // input dataset
  lazy val adDataPath: String = config.getString("addata.path")
  lazy val ipsDataPath: String = config.getString("ipdata.geo.path")
  def ipToRegionFilePath: String = config.getString("qqwrydat.path")
  def installDir: String = config.getString("installDir.path")

  // output dataset
  private lazy val delimiter = "_"
  private lazy val odsPrefix: String = config.getString("ods.prefix")
  private lazy val adInfoTableName: String = config.getString("ad.data.tablename")
//  lazy val ADMainTableName = s"$odsPrefix$delimiter$adInfoTableName$delimiter${DateUtils.getTodayDate()}"

  // report
  lazy val Report1RegionStatTableName: String = config.getString("report.region.stat.tablename")
  lazy val ReportRegionTableName: String = config.getString("report.region.tablename")
  lazy val ReportAppTableName: String = config.getString("report.app.tablename")
  lazy val ReportDeviceTableName: String = config.getString("report.device.tablename")
  lazy val ReportNetworkTableName: String = config.getString("report.network.tablename")
  lazy val ReportIspTableName: String = config.getString("report.isp.tablename")
  lazy val ReportChannelTableName: String = config.getString("report.channel.tablename")

  // GaoDe API
  private lazy val gaoDeKey: String = config.getString("gaoDe.app.key")
  private lazy val gaoDeTempUrl: String = config.getString("gaoDe.url")
  lazy val gaoDeUrl: String = s"$gaoDeTempUrl&key=$gaoDeKey"

  // GeoHash
  lazy val keyLength: Int = config.getInt("geohash.key.length")

  // Business District Library
  lazy val tradingAreaTableName: String =config.getString("trading.area.tablename")

  // tags
  lazy val idFields: String = config.getString("non.empty.field")
  lazy val filterSQL: String = idFields
    .split(",")
    .map(field => s"$field is not null ")
    .mkString(" or ")
  lazy val appNameDic: String = config.getString("appname.dic.path")
  lazy val deviceDic: String = config.getString("device.dic.path")
  lazy val tagsTableNamePrefix: String = config.getString("tags.table.name.prefix") + delimiter
  lazy val tagCoeff: Double = config.getDouble("tag.coeff")

  // Load elasticsearch related parameters
  lazy val ESSparkParameters = List(
    ("cluster.name", config.getString("es.cluster.name")),
    ("es.index.auto.create", config.getString("es.index.auto.create")),
    ("es.nodes", config.getString("es.Nodes")),
    ("es.port", config.getString("es.port")),
    ("es.index.reads.missing.as.empty", config.getString("es.index.reads.missing.as.empty")),
    ("es.nodes.discovery", config.getString("es.nodes.discovery")),
    ("es.nodes.wan.only", config.getString("es.nodes.wan.only")),
    ("es.http.timeout", config.getString("es.http.timeout"))
  )

  def main(args: Array[String]): Unit = {
    println(ConfigHolder.sparkParameters)
    println(ConfigHolder.installDir)
  }
}

3.8 DmpApp main program (using configuration file)

import cn.itbigdata.dmp.utils.ConfigHolder
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object DmpApp {
  def main(args: Array[String]): Unit = {
    // 1. Initialization (SparkConf, SparkSession)
    val conf = new SparkConf()
      .setAppName(ConfigHolder.sparkAppName)
      .setMaster(ConfigHolder.sparkMaster)
      .setAll(ConfigHolder.sparkParameters)

    val spark: SparkSession = SparkSession.builder()
      .config(conf)
      .getOrCreate()
    spark.sparkContext.setLogLevel("warn")
    println("OK!")

    // 1,ETL

    // 2. Statement

    // 3. Generate business district library

    // 4. Tagging

    // close resource
    spark.close()
  }
}

4. ETL development

Requirements:

  • Convert the ip address in each line of the data file into longitude, dimension, province and city information;

IP = > longitude, dimension, province, city
*Save the converted data file (one file per day)

Processing steps:

  • Read data
  • data processing
  • Find the ip address in each row of data
  • According to the ip address, calculate the corresponding province, city, longitude and latitude, and add them to the tail of each line of data
  • Save data
  • Other requirements: the data is loaded once a day, and the daily data is stored in a separate file

Difficult problem: data processing (how to convert IP address into province, city, longitude and latitude)

4.1 build ETL architecture

Create a new trait (Processor) to provide a unified interface class for data processing

import org.apache.spark.sql.SparkSession

// Data processing interface
// SparkSession is used for data loading and processing
// KuduContext is used to save data
trait Processor {
  def process(spark: SparkSession)
}

Create a new ETLProcessor to be responsible for ETL processing

import cn.itbigdata.dmp.customtrait.Processor
import org.apache.kudu.spark.kudu.KuduContext
import org.apache.spark.sql.SparkSession

object ETLProcessor extends Processor{
  override def process(spark: SparkSession): Unit = {
    // Define parameters
    val sourceDataFile = ConfigHolder.adLogPath
    val sinkDataPath = ""

    // 1 read data
    val sourceDF: DataFrame = spark.read.json(sourceDataFile)

    // 2 processing data
    // 2.1 find ip
    // 2.2 convert ip to province, city, longitude and dimension
    val rdd = sourceDF.rdd
      .map(row => {
        val ip: String = row.getAs[String]("ip")
        ip
      })

    // 2.3 put the province, city, longitude and dimension at the end of the original data

    // 3 save data
  }
}

4.2 IP address conversion to latitude and longitude

  • Use GeoIP to convert the ip address to latitude and longitude
  • GeoIP is a set of software tools with IP database
  • Geo locates the location information such as longitude and latitude, country / region, province, city and street of the IP according to the visitor's IP
  • GeoIP is available in two versions, a free version and a paid version
  • The accuracy of the charged version is higher and the update frequency is more frequent
  • Because GeoIP reads the local binary IP database, it is very efficient

4.3 IP address conversion

  • Pure database, Turn ip into provincial and municipal
  • The pure database collects IP address data of ISP s including China Telecom, China Mobile, China Unicom, Great Wall broadband and Juyou broadband
  • The pure database is a binary file with open source java code. It can be simply modified and called
case class Location(ip: String, region: String, city: String, longitude: Float, latitude: Float)

  private def ipToLocation(ip: String): Location ={
    // 1 get service
    val service = new LookupService("data/geoLiteCity.dat")

    // 2 get Location
    val longAndLatLocation = service.getLocation(ip)

    // 3. Obtain longitude and dimension
    val longitude = longAndLatLocation.longitude
    val latitude = longAndLatLocation.latitude

    // 4 use the innocence database to obtain provincial and municipal information
    val ipService = new IPAddressUtils
    val regeinLocation: IPLocation = ipService.getregion(ip)
    val region = regeinLocation.getRegion
    val city = regeinLocation.getCity

    Location(ip, region, city, longitude, latitude)
  }

Need to implement help class:

  • Calculate the date of the day
  import java.util.{Calendar, Date}
  import org.apache.commons.lang.time.FastDateFormat

  object DateUtils {
    def getToday: String = {
      val now = new Date
      FastDateFormat.getInstance("yyyyMMdd").format(now)
  }

    def getYesterday: String = {
      val calendar: Calendar = Calendar.getInstance
      calendar.set(Calendar.HOUR_OF_DAY, -24)
      FastDateFormat.getInstance("yyyyMMdd").format(calendar.getTime())
    }

    def main(args: Array[String]): Unit = {
      println(getToday)
      println(getYesterday)
    }
  }
  

4.4 complete realization of ETL

import java.util.Calendar
import cn.itbigdata.dmp.customtrait.Processor
import cn.itbigdata.dmp.util.iplocation.{IPAddressUtils, IPLocation}
import com.maxmind.geoip.LookupService
import org.apache.commons.lang3.time.FastDateFormat
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

object ETLProcessor extends Processor{
  // Define parameters
  private val sourceDataFile: String = "data/data.json"
  private val sinkDataPath: String = s"outputdata/maindata.${getYesterday}"
  private val geoFilePath: String = "data/geoLiteCity.dat"

  override def process(spark: SparkSession): Unit = {
    // 1 read data
    val sourceDF: DataFrame = spark.read.json("data/data.json")

    // 2 processing data
    // 2.1 find ip
    // 2.2 convert ip to province, city, longitude and dimension
    import spark.implicits._
    val ipDF: DataFrame = sourceDF.rdd
      .map { row =>
        val ip = row.getAs[String]("ip")
        // Convert ip to province, city, longitude and latitude
        ip2Location(ip)
      }.toDF

    // 2.2 join IPDF and sourceDF to add Province, city, longitude and latitude to each line
    val sinkDF: DataFrame = sourceDF.join(ipDF, Seq("ip"), "inner")

    // 3 save data
    sinkDF.write.mode(SaveMode.Overwrite).json(sinkDataPath)
  }

  case class Location(ip: String, region: String, city: String, longitude: Float, latitude: Float)

  private def ip2Location(ip: String): Location ={
    // 1 get service
    val service = new LookupService(geoFilePath)

    // 2 get Location
    val longAndLatLocation = service.getLocation(ip)

    // 3. Obtain longitude and dimension
    val longitude = longAndLatLocation.longitude
    val latitude = longAndLatLocation.latitude

    // 4 use the innocence database to obtain provincial and municipal information
    val ipService = new IPAddressUtils
    val regionLocation: IPLocation = ipService.getregion(ip)
    val region = regionLocation.getRegion
    val city = regionLocation.getCity

    Location(ip, region, city, longitude, latitude)
  }

  private def getYesterday: String = {
    val calendar: Calendar = Calendar.getInstance
    calendar.set(Calendar.HOUR_OF_DAY, -24)
    FastDateFormat.getInstance("yyyyMMdd").format(calendar.getTime())
  }
}

5. Report development (data analysis – SparkSQL)

Reports to be processed

  • Count the quantity distribution of each region (RegionStatProcessor)
  • Statistics of regional distribution of advertising (region analysis processor)
  • APP distribution statistics (AppAnalysisProcessor)
  • Mobile device type distribution statistics (DeviceAnalysisProcessor)
  • Network type distribution statistics (network analysis processor)
  • Distribution statistics of network operators (IspAnalysisProcessor)
  • Channel distribution statistics (ChannelAnalysisProcessor)

5.1 geographical distribution of data

  • Steps of report processing
  • Understand business requirements: find out the distribution of data volume according to provincial and municipal grouping
  • Source data: daily log data, i.e. ETL result data;
  • Target data: saved in a local file, and each report corresponds to a directory;
  • Write SQL and test
  • code implementation
  • Define RegionStatProcessor, which inherits from Processor and implements process method. The specific implementation steps are as follows:
import cn.itbigdata.dmp.customtrait.Processor
import cn.itbigdata.dmp.utils.{ConfigHolder, DateUtils}
import org.apache.spark.sql.{SaveMode, SparkSession}

object RegionStatProcessor extends Processor{
  override def process(spark: SparkSession): Unit = {
    // Define parameters
    val sourceDataPath = s"outputdata/maindata-${DateUtils.getYesterday}"
    val sinkDataPath = "output/regionstat"

    // read file
    val sourceDF = spark.read.json(sourceDataPath)
    sourceDF.createOrReplaceTempView("adinfo")

    // Processing data
    val RegionSQL1 =
      """
        |select to_date(now()) as statdate, region, city, count(*) as infocount
        |  from adinfo
        |group by region, city
        |""".stripMargin

    val sinkDF = spark.sql(RegionSQL1)
    sinkDF.show()

    // Write file
    sinkDF.coalesce(1).write.mode(SaveMode.Append).json(sinkDataPath)
  }
}

5.2 regional distribution of advertising

Complete the report in the following modes as required

Note: three rates are required: bidding success rate, advertising click through rate and media click through rate

Index calculation logic

index explain adplatformproviderid requestmode processnode iseffective isbilling isbid iswin adorderid adcreativeid
Original request Number of all original requests sent 1 >=1
Valid request Number of valid physical examinations 1 >=2
Advertising request Number of requests that meet advertising requests 1 3
Number of bidding participants Times of participating in bidding >=100000 1 1 1 !=0
Number of successful bidding Number of successful bidding >=100000 1 1 1
(Advertiser) number of displays Statistics for advertisers: the number of advertisements finally displayed on the terminal 2 1
(Advertiser) hits Statistics for advertisers: the actual number of clicks after the advertisement is displayed 3 1
Number of (media) displays For media statistics: the number of advertisements displayed at the terminal 2 1 1
(media) hits For media statistics: the number of actually clicked advertisements displayed 3 1 1
DSP advertising consumption winprice/1000 >=100000 1 1 1 >200000 >200000
DSP advertising cost Adptment/1000 >=100000 1 1 1 >200000 >200000

DSP advertising consumption = DSP RTB money

DSP advertising cost = money paid by advertisers to DSP

DSP's profit = DSP advertising cost - DSP advertising consumption

Note: corresponding fields:

OriginalRequest,ValidRequest,adRequest
bidsNum,bidsSus,bidRate
adDisplayNum,adClickNum,adClickRate
MediumDisplayNum,MediumClickNum,MediumClickRate
adconsume,adcost

code implementation

import cn.itbigdata.dmp.customtrait.Processor
import cn.itbigdata.dmp.utils.DateUtils
import org.apache.spark.sql.{SaveMode, SparkSession}

object RegionAnalysisProcessor extends Processor{
  override def process(spark: SparkSession): Unit = {
    // Define parameters
    val sourceDataPath = s"outputdata/maindata-${DateUtils.getYesterday}"
    val sinkDataPath = "outputdata/regionanalysis"

    // read file
    val sourceDF = spark.read.json(sourceDataPath)
    sourceDF.createOrReplaceTempView("adinfo")

    // Processing data
    val RegionSQL1 =
      """
        |select to_date(now()) statdate, region, city,
        |       sum(case when requestmode=1 and processnode>=1 then 1 else 0 end) as OriginalRequest,
        |       sum(case when requestmode=1 and processnode>=2 then 1 else 0 end) as ValidRequest,
        |       sum(case when requestmode=1 and  processnode=3 then 1 else 0 end) as adRequest,
        |       sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and isbid=1 and adorderid!=0
        |                then 1 else 0 end) as bidsNum,
        |       sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1
        |                then 1 else 0 end) as bidsSus,
        |       sum(case when requestmode=2 and iseffective=1 then 1 else 0 end) as adDisplayNum,
        |       sum(case when requestmode=3 and iseffective=1 then 1 else 0 end) as adClickNum,
        |       sum(case when requestmode=2 and iseffective=1 and isbilling=1 then 1 else 0 end) as MediumDisplayNum,
        |       sum(case when requestmode=3 and iseffective=1 and isbilling=1 then 1 else 0 end) as MediumClickNum,
        |       sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1
        |                 and iswin=1 and adorderid>200000 and adcreativeid>200000
        |                then winprice/1000 else 0 end) as adconsume,
        |       sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1
        |                 and iswin=1 and adorderid>200000 and adcreativeid>200000
        |                then adpayment/1000 else 0 end) as adcost
        |  from adinfo
        |group by region, city
        |    """.stripMargin
    spark.sql(RegionSQL1).createOrReplaceTempView("tabtemp")

    val RegionSQL2 =
      """
        |select statdate, region, city,
        |       OriginalRequest, ValidRequest, adRequest,
        |       bidsNum, bidsSus, bidsSus/bidsNum as bidRate,
        |       adDisplayNum, adClickNum, adClickNum/adDisplayNum as adClickRate,
        |       MediumDisplayNum, MediumClickNum, MediumClickNum/MediumDisplayNum as mediumClickRate,
        |       adconsume, adcost
        |  from tabtemp
    """.stripMargin
    val sinkDF = spark.sql(RegionSQL2)

    // Write file
    sinkDF.coalesce(1).write.mode(SaveMode.Append).json(sinkDataPath)
  }
}

6. Data tagging

6.1 what is data tagging

  • Why label data

  • Analyze data requirements

  • The user's demand for data search supports the conditional screening of targeted people. For example:
    • Region, even business district
    • Gender
    • Age
    • interest
    • equipment
  • data format

Target data: (user id, all tags). The label is as follows:

( CH@123485 -> 1.0, KW@word -> 1.0, CT@Beijing ->1.0, Gd @ female - > 1.0, AGE@40 ->1.0, Ta @ Beihai - > 1.0, Ta @ Beach - > 1.0)
*Tag data organization form Map[String, Double]
*Prefix + label; 1.0 is the weight
*Labels to be made

  • Advertising type
  • channel
  • App name
  • Gender
  • geographical position
  • equipment
  • key word
  • Age
  • Business district (temporarily ignored)
  • Tagging of log data
  • Calculate tags (AD type, channel, AppName, gender...)

  • Extract user ID
  • Unified user identification
  • Label data landing

6.2 framing

object TagProcessor extends Processor{
  override def process(spark: SparkSession, kudu: KuduContext): Unit = {
    // Define parameters
    val sourceTableName = ConfigHolder.ADMainTableName
    val sinkTableName = ""
    val keys = ""

    // 1 read data
    val sourceDF = spark.read
      .option("kudu.master", ConfigHolder.kuduMaster)
      .option("kudu.table", sourceTableName)
      .kudu

    // 2 processing data
    sourceDF.rdd
      .map(row => {
        // Advertising type, channel and App name
        val adTags = AdTypeTag.make(row)

        // Gender, geographical location, equipment

        // Keywords, age, business district

      })

    // 3 save data

  }
}

Define interface classes:

import org.apache.spark.sql.Row

trait TagMaker {
  def make(row: Row, dic: collection.Map[String, String]=null): Map[String, Double]
}

6.3 labeling

6.3.1 ad type (AdTypeTag)

Field meaning 1: banner; 2: Insert screen; 3: Full screen

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.spark.sql.Row

object AdTypeTag extends TagMaker{
  private val adPrefix = "adtype@"

  override def make(row: Row, dic: collection.Map[String, String]): Map[String, Double] = {
    // 1 value
    val adType: Long = row.getAs[Long]("adspacetype")
    // 1: banner;  2: Insert screen; 3: Full screen

    // 2 Calculation label
    adType match {
      case 1 => Map(s"${adPrefix}banner" -> 1.0)
      case 2 => Map(s"${adPrefix}Insert screen" -> 1.0)
      case 3 => Map(s"${adPrefix}Full screen" -> 1.0)
      case _ => Map[String, Double]()
    }
  }
}

6.3.2 channel (ChannelTag)

Field: channelid

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object ChannelTag extends TagMaker{
  private val channelPrefix = "channel@"

  override def make(row: Row, dic: collection.Map[String, String]=null): Map[String, Double] = {
    // 1 value
    val channelid = row.getAs[String]("channelid")
    // 2 Calculation label
    if (StringUtils.isNotBlank(channelid)){
      Map(s"${channelPrefix}channelid" -> 1.0)
    }
    else
      Map[String, Double]()
  }

  def main(args: Array[String]): Unit = {
    // Judge whether a string is not empty, the length is not 0, and is not composed of white space (space)
    if (!StringUtils.isNotBlank(null)) println("blank1 !")
    if (!StringUtils.isNotBlank("")) println("blank2 !")
    if (!StringUtils.isNotBlank("   ")) println("blank3 !")
  }
}

6.3.3 App name (AppNameTag)

Field: appid; To convert appid to appname

Look up the given dictionary table: dicapp

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object AppNameTag extends TagMaker{
  // Get prefix
  private val appNamePrefix = "appname@"

  override def make(row: Row, dic: collection.Map[String, String]): Map[String, Double] = {
    // 1 get location information
    val appId = row.getAs[String]("appid")

    // 2 calculate and return labels
    val appName = dic.getOrElse(appId, "")
    if (StringUtils.isNotBlank(appName))
      Map(s"${appNamePrefix}$appName" -> 1.0)
    else
      Map[String, Double]()
  }
}

6.3.4 gender (SexTag)

Field: sex;

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object SexTag extends TagMaker{
  private val sexPrefix: String = "sex@"

  override def make(row: Row, dic: collection.Map[String, String]=null): Map[String, Double] = {
    // Get tag information
    val sexid: String = row.getAs[String]("sex")
    val sex = sexid match {
      case "1" => "male"
      case "2" => "female"
      case _   => "To be filled in"
    }

    // Calculation return label
    if (StringUtils.isNotBlank(sex))
      Map(s"$sexPrefix$sex" -> 1.0)
    else
      Map[String, Double]()
  }
}

6.3.5 geographic location (GeoTag)

Fields: region, city

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object GeoTag extends TagMaker{
  private val regionPrefix = "region@"
  private val cityPrefix = "city@"

  override def make(row: Row, dic: collection.Map[String, String]=null): Map[String, Double] = {
    // Get tag information
    val region = row.getAs[String]("region")
    val city = row.getAs[String]("city")

    // Calculate and return label information
    val regionTag = if (StringUtils.isNotBlank(region))
      Map(s"$regionPrefix$region" -> 1.0)
    else
      Map[String, Double]()

    val cityTag = if (StringUtils.isNotBlank(city))
      Map(s"$cityPrefix$city" -> 1.0)
    else
      Map[String, Double]()

    regionTag ++ cityTag
  }
}

6.3.6 equipment (DeviceTag)

Fields: client, networkmannername, ispname;

Data dictionary: dicdevice

  • client: device type (1: android 2: ios 3: wp 4: others)
  • networkmannername: name of networking mode (2G, 3G, 4G, others)
  • ispname: operator name (Telecom, China Mobile, China Unicom...)
import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object DeviceTag extends TagMaker{
  val clientPrefix = "client@"
  val networkPrefix = "network@"
  val ispPrefix = "isp@"

  override def make(row: Row, dic: collection.Map[String, String]): Map[String, Double] = {
    // Get tag information
    val clientName: String = row.getAs[Long]("client").toString
    val networkName: String = row.getAs[Long]("networkmannername").toString
    val ispName: String = row.getAs[Long]("ispname").toString

    // Calculate and return labels
    val clientId = dic.getOrElse(clientName, "D00010004")
    val networkId = dic.getOrElse(networkName, "D00020005")
    val ispId = dic.getOrElse(ispName, "D00030004")

    val clientTag = if (StringUtils.isNotBlank(clientId))
      Map(s"$clientPrefix$clientId" -> 1.0)
    else
      Map[String, Double]()

    val networkTag = if (StringUtils.isNotBlank(networkId))
      Map(s"$networkPrefix$networkId" -> 1.0)
    else
      Map[String, Double]()

    val ispTag = if (StringUtils.isNotBlank(ispId))
      Map(s"$ispPrefix$ispId" -> 1.0)
    else
      Map[String, Double]()

    clientTag ++ networkTag ++ ispTag
  }
}

6.3.7 keywords (KeywordTag)

Fields: keywords

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object KeywordsTag extends TagMaker{
  private val keywordPrefix = "keyword@"

  override def make(row: Row, dic: collection.Map[String, String]): Map[String, Double] = {
    row.getAs[String]("keywords")
      .split(",")
      .filter(word => StringUtils.isNotBlank(word))
      .map(word => s"$keywordPrefix$word" -> 1.0)
      .toMap
  }
}

6.3.8 age (AgeTag)

Field: age

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object AgeTag extends TagMaker{
  private val agePrefix = "age@"

  override def make(row: Row, dic: collection.Map[String, String]): Map[String, Double] = {
    val age = row.getAs[String]("age")

    if (StringUtils.isNotBlank(age))
      Map(s"$agePrefix$age" -> 1.0)
    else
      Map[String, Double]()
  }
}

6.3.9 main handler (TagProcessor)

import cn.itbigdata.dmp.customtrait.Processor
import cn.itbigdata.dmp.utils.DateUtils
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.SparkSession

object TagProcessor extends Processor{
  override def process(spark: SparkSession): Unit = {
    // Define parameters
    val sourceTableName = s"outputdata/maindata-${DateUtils.getYesterday}"
    val appdicFilePath = "data/dicapp"
    val deviceFilePath = "data/dicdevice"
    val sinkTableName = ""

    // 1 read data
    val sourceDF = spark.read.json(sourceTableName)

    // Read app information (file) and convert it into broadcast variable (optimization)
    val appdicMap = spark.sparkContext.textFile(appdicFilePath)
      .map(line => {
        val arr: Array[String] = line.split("##")
        (arr(0), arr(1))
      }).collectAsMap()
    val appdicBC: Broadcast[collection.Map[String, String]] = spark.sparkContext.broadcast(appdicMap)

    // Read dictionary information (file) and convert it to broadcast variables (optimization)
    val deviceMap = spark.sparkContext.textFile(deviceFilePath)
      .map(line => {
        val arr: Array[String] = line.split("##")
        (arr(0), arr(1))
      }).collectAsMap()
    val deviceBC: Broadcast[collection.Map[String, String]] = spark.sparkContext.broadcast(deviceMap)

    // 2 processing data
    sourceDF.printSchema()
    sourceDF.rdd
      .map(row => {
        // Advertising type, channel and App name
        val adTags: Map[String, Double] = AdTypeTag.make(row)
        val channelTags: Map[String, Double] = ChannelTag.make(row)
        val appNameTags: Map[String, Double] = AppNameTag.make(row, appdicBC.value)

        // Gender, geographical location, equipment type
        val sexTags = SexTag.make(row)
        val geoTags = GeoTag.make(row)
        val deviceTags = DeviceTag.make(row, deviceBC.value)

        // Keywords, age
        val keywordsTags = KeywordsTag.make(row)
        val ageTags = AgeTag.make(row)

        // Return all data into a large Map
        val tags = adTags ++ channelTags ++ appNameTags ++ sexTags ++ geoTags ++ deviceTags ++ keywordsTags ++ ageTags

        tags
      }).collect.foreach(println)

    // 3 save data

  }
}

6.4 extracting user ID

  • The log data is specific to a user's single browsing behavior

  • A user may have multiple pieces of data in a day

  • The label is for people

  • Existing problems:

  • Extract the concept of people from the data set so that one person can correspond to one piece of data

  • The available user id cannot be found in the log information, so we can only go back to the second place, find the information of the device, and identify the user with the information of the device:
    • IMEI: International Mobile Equipment Identity (IMEI), commonly known as mobile phone serial number and mobile phone "serial number", is used to identify each independent mobile phone and other mobile communication equipment in the mobile phone network, which is equivalent to the ID card of the mobile phone. IMEI is written on the motherboard. Reinstalling the APP will not change IMEI. Android 6.0 and above systems require the user to grant read_phone_state permission, which cannot be obtained if the user refuses;
    • IDFA: launched in iOS 6, it can monitor the advertising effect and ensure that the user equipment is not tracked by the APP. Changes may occur, such as system reset and restoring the advertisement identifier in the settings. Users can open "restricted advertisement tracking" in settings;
    • mac address: hardware identifier, including WiFi mac address and Bluetooth mac address. After iOS 7, it is prohibited;
    • OpenUDID: when iOS 5 was released, UDID was abandoned, which caused the majority of developers to find a solution that can replace UDID and is not controlled by apple. Thus, OpenUDID became the most widely used alternative to open source UDID at that time. OpenUDID is very simple to implement in the project, and also supports a series of advertising providers;
    • Android ID: when the device is started for the first time, the system will randomly generate a 64 bit number and save it in the form of hexadecimal string, which is ANDROID_ID, which will be reset when the device is wipe;
  • The fields in the log data that can be used to identify users include:
    • imei,mac,idfa,openudid,androidid
    • imeimd5,macmd5,idfamd5,openudidmd5,androididmd5
    • imeisha1,macsha1,idfasha1,openudidsha1,androididsha1
  • What is invalid data: if all the above 15 fields are empty, this data cannot be associated with any user. This data is useless to us. It is invalid data. This data needs to be removed.
  // Filtering is required when 15 fields are empty at the same time
  lazy val filterSQL: String = idFields
      .split(",")
      .map(field => s"$field != ''")
      .mkString(" or ")

          // Extract user ID
          val userIds = getUserIds(row)

          // Return label
          (userIds.head, (userIds, tags))

    // Extract user ID
    private def getUserIds(row: Row): List[String] = {
      val userIds: List[String] = idFields.split(",")
        .map(field => (field, row.getAs[String](field)))
        .filter { case (key, value) => StringUtils.isNotBlank(value) }
        .map { case (key, value) => s"$key::$value" }.toList

      userIds
    }
  

6.5 user identification

  • Use fifteen fields (non empty) to jointly identify users
  • During data collection:
  • The data collected each time may be different fields
  • Some fields may also change
  • How to identify the data of the same user?

6.6 user identification & data aggregation and consolidation

  // Unified user identification; Data aggregation and consolidation
  private def graphxAnalysis(rdd: RDD[(List[String], List[(String, Double)])]): RDD[(List[String], List[(String, Double)])] ={
    // 1 define vertices (data structure: Long, ""; algorithm: each element in the List can be used as a vertex, and the List itself can also be used as a vertex)
    val dotsRDD: RDD[(String, List[String])] = rdd.flatMap{ case (lst1, _) => lst1.map(elem => (elem, lst1)) }
    val vertexes: RDD[(Long, String)] = dotsRDD.map { case (id, ids) => (id.hashCode.toLong, "") }

    // 2 define edges (data structure: Edge(Long, Long, 0))
    val edges: RDD[Edge[Int]] = dotsRDD.map { case (id, ids) => Edge(id.hashCode.toLong, ids.mkString.hashCode.toLong, 0) }

    // 3 generation diagram
    val graph = Graph(vertexes, edges)

    // 4 strongly connected graph
    val idRDD: VertexRDD[VertexId] = graph.connectedComponents()
      .vertices

    // 5 definition data (ids and tags)
    val idsRDD: RDD[(VertexId, (List[String], List[(String, Double)]))] =
      rdd.map { case (ids, tags) => (ids.mkString.hashCode.toLong, (ids, tags)) }

    // 6. join the results of step 4 with the results of step 5, and classify all the data [one user, one classification]
    val joinRDD: RDD[(VertexId, (List[String], List[(String, Double)]))] = idRDD.join(idsRDD)
      .map { case (key, value) => value }

    // 7 data aggregation (data of the same user are put together)
    val aggRDD: RDD[(VertexId, (List[String], List[(String, Double)]))] = joinRDD.reduceByKey { case ((bufferIds, bufferTags), (ids, tags)) =>
      (bufferIds ++ ids, bufferTags ++ tags)
    }

    // 8. Data consolidation (de duplication for id and consolidation weight for tags)
    val resultRDD: RDD[(List[String], List[(String, Double)])] = aggRDD.map { case (key, (ids, tags)) =>
      val newTags = tags.groupBy(x => x._1)
        .mapValues(lst => lst.map { case (word, weight) => weight }.sum)
        .toList
      (ids.distinct, newTags)
    }

    resultRDD
  }


6.7 label landing

When saving data to kudu, please note:

1. Save a table every day (you need to create a new table). Table name: usertags_ Date of the day

2. Data type conversion RDD [(list [string], list [(string, double)])] = > RDD [(string, string)] = > dataframe

  • Convert List[String] to String; Pay attention to the definition of separator
  • Convert List[(String, Double)] to String. Pay attention to the definition of separator
  • Separator: cannot be duplicate with the symbol in the data; The separator must be able to be added and removed.
    // 3 data landing (kudu)
    // Change the List data type to String
    import spark.implicits._
    val resultDF = mergeRDD.map{ case (ids, tags) =>
      (ids.mkString("||"), tags.map{case (key, value) => s"$key->$value"}.mkString("||"))
    }.toDF("ids", "tags")
    DBUtils.appendData(kudu, resultDF, sinkTableName, keys)
  }

// Get yesterday's date

6.8 tag processor

package cn.itcast.dmp.tags

import cn.itcast.dmp.Processor
import cn.itcast.dmp.utils.ConfigHolder
import org.apache.commons.lang3.StringUtils
import org.apache.kudu.spark.kudu.{KuduContext, KuduDataFrameReader}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}

object TagProcessor extends Processor{
  override def process(spark: SparkSession, kudu: KuduContext): Unit = {
    // Define parameters
    val sourceTableName = ConfigHolder.ADMainTableName
    val sinkTableName = ""
    val keys = ""
    val dicAppPath = ConfigHolder.appNameDic
    val dicDevicePath = ConfigHolder.deviceDic
    val tradingAreaTableName = ConfigHolder.tradingAreaTableName
    val filterSQL = ConfigHolder.filterSQL
    val idFields = ConfigHolder.idFields

    // 1 read data
    val sc = spark.sparkContext
    val sourceDF = spark.read
      .option("kudu.master", ConfigHolder.kuduMaster)
      .option("kudu.table", sourceTableName)
      .kudu

    // Read app dictionary information
    val appDic = sc.textFile(dicAppPath)
      .map(line => {
        val arr = line.split("##")
        (arr(0), arr(1))
      })
      .collect()
      .toMap
    val appDicBC = sc.broadcast(appDic)

    // Read device dictionary information
    val deviceDic = sc.textFile(dicDevicePath)
      .map(line => {
        val arr = line.split("##")
        (arr(0), arr(1))
      })
      .collect()
      .toMap
    val deviceDicBC = sc.broadcast(deviceDic)

    // Read business district information (read; filter; convert to rdd; retrieve data; collect data to driver; convert to map)
    // Restrictions: the information in the business district table should not be too large (the filtered size should be less than 20M)
    val tradingAreaDic: Map[String, String] = spark.read
      .option("kudu.master", ConfigHolder.kuduMaster)
      .option("kudu.table", tradingAreaTableName)
      .kudu
      .filter("areas!=''")
      .rdd
      .map { case Row(geohash: String, areas: String) => (geohash, areas) }
      .collect()
      .toMap
    val tradingAreaBC = sc.broadcast(tradingAreaDic)

    // 2 processing data
    // Filter the data whose 15 identification fields are empty
    val userTagsRDD: RDD[(List[String], List[(String, Double)])] = sourceDF.filter(filterSQL)
      .rdd
      .map(row => {
        // Advertising type, channel and App name
        val adTags = AdTypeTag.make(row)
        val channelTags = ChannelTag.make(row)
        val appNameTags = AppNameTag.make(row, appDicBC.value)

        // Gender, geographical location, equipment
        val sexTags = SexTag.make(row, appDicBC.value)
        val geoTags = GeoTag.make(row, appDicBC.value)
        val deviceTags = DeviceTag.make(row, deviceDicBC.value)

        // Keywords, age, business district
        val keywordTags = KeywordTag.make(row, appDicBC.value)
        val ageTags = AgeTag.make(row, appDicBC.value)
        val tradingAreaTags = tradingAreaTag.make(row, tradingAreaDic.value)

        // Label merge
        val tags = adTags ++ channelTags ++ appNameTags ++ sexTags ++ geoTags ++ deviceTags ++ keywordTags ++ ageTags ++ tradingAreaTags

        // Extract user ID
        val userIds: List[String] = idFields.split(",")
          .map(field => (field, row.getAs[String](field)))
          .filter { case (key, value) => StringUtils.isNotBlank(value) }
          .map { case (key, value) => s"$key::$value" }.toList

        // Return label
        (userIds, tags)
      })
    userTagsRDD.foreach(println)

    // 3. Unify user identification and merge data
    val mergeRDD: RDD[(List[String], List[(String, Double)])] = graphxAnalysis(logTagsRDD)      
      
    // 4 data landing (kudu)
    // Change the List data type to String
    import spark.implicits._
    val resultDF = mergeRDD.map{ case (ids, tags) =>
      (ids.mkString("|||"), tags.map{case (key, value) => s"$key->$value"}.mkString("|||"))
    }.toDF("ids", "tags")
    DBUtils.createTableAndsaveData(kudu, resultDF, sinkTableName, keys)      
      
    // close resource
    sc.stop()  
  }
}

7,Spark GraphX

7.1 basic concept of figure calculation

Graph is a mathematical structure used to represent the model relationship between objects. A graph consists of vertices and edges connecting vertices. Vertices are objects and edges are relationships between objects.

1555407826579.png

A directed graph is a graph in which the edges between vertices are directional. Examples of directed graphs are followers on Twitter. User Bob pays attention to user Carol, but Carol doesn't pay attention to Bob.

1555407851059.png

A graph is a graph that forms the relationship between different objects through points (objects) and edges (paths)

Figure 7.2 calculation application scenario

1) Shortest path

Shortest path in social networks, there is a six dimensional space theory, which means that there will be no more than five people between you and any stranger, that is, you can know any stranger through five intermediaries at most. This is also a graph algorithm, that is, the shortest path between any two people is less than or equal to 6.

2) Community discovery

Community discovery is used to find the number of triangles (circles) in social networks. It can analyze which circles are more stable and closer. It is used to measure the tightness of community coupling. The more triangles in a person's social circle, the more stable and close his social relationship is. For social networking sites such as Facebook and Twitter, the commonly used social analysis algorithm is community discovery.

3) Recommended algorithm (ALS)

Recommendation algorithm (ALS) als is a matrix decomposition algorithm. For example, if a shopping website wants to recommend goods to users, it needs to know which users are interested in which goods. At this time, a matrix can be constructed through ALS. In this matrix, if the goods purchased by users are 1 and those not purchased by users are 0, At this time, we need to calculate which zeros are likely to become 1

GraphX extends Spark RDD through an elastic distributed attribute graph.

Generally, in graph calculation, the basic data structure expression is:

  • Graph = (Vertex,Edge)
  • Vertex (vertex / node) (VertexId: Long, info: Any)
  • Edge Edge(srcId: VertexId, dstId: VertexId, attr) [attr weight]

7.3 Spark GraphX example 1 (strong connector)


1557105493255.png

ID key word AppName
1 Carola Group car
2 Bali, Indonesia Where to travel
3 Master Shandao Know
4 The king's woman, the beauty has no tears Youku
5 World Cup Sohu
6 Carina Lau, Hong Kong and Taiwan Entertainment Phoenix net
7 Japan and South Korea Entertainment Pepper direct seeding
9 AK47 Jedi survival: stimulating the battlefield
10 Funny YY live broadcast
11 Literature, current politics Know
ID IDS
1 43125
2 43125
3 43125
4 43125
5 43125
4 4567
5 4567
6 4567
7 4567
9 91011
10 91011
11 91011

Connected Components algorithm:

1. Define vertex

2. Define edge

3. Generate graph

4. Each connector in the graph is labeled, and the id of the vertex with the smallest sequence number in the connector is taken as the id of the connector

Task:

  1. Define vertex
  2. Define edge
  3. Generate graph
  4. Generating strongly connected graphs
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object GraphXDemo1 {
  def main(args: Array[String]): Unit = {
    // 1. Initialize sparkcontext
    val conf = new SparkConf()
      .setAppName("GraphXDemo1")
      .setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("ERROR")

    // 2. Define vertices (VertexId: Long, info: Any)
    val vertexes: RDD[(VertexId, Map[String, Double])] = sc.makeRDD(List(
      (1L, Map("keyword:Carola" -> 1.0, "AppName:Group car" -> 1.0)),
      (2L, Map("keyword:Indonesia" -> 1.0, "keyword:Bali" -> 1.0, "AppName:Where to travel" -> 1.0)),
      (3L, Map("keyword:Master Shandao" -> 1.0, "AppName:Know" -> 1.0)),
      (4L, Map("keyword:The king's Woman" -> 1.0, "keyword:Beauty without tears" -> 1.0, "AppName:Youku" -> 1.0)),
      (5L, Map("keyword:World Cup" -> 1.0, "AppName:Sohu" -> 1.0)),
      (6L, Map("keyword:Carina Lau" -> 1.0, "keyword:Hong Kong and Taiwan Entertainment" -> 1.0, "AppName:Phoenix net" -> 1.0)),
      (7L, Map("keyword:Japan and South Korea Entertainment" -> 1.0, "AppName:Pepper direct seeding" -> 1.0)),
      (9L, Map("keyword:AK47" -> 1.0, "AppName:Jedi survival:Stimulate the battlefield" -> 1.0)),
      (10L, Map("keyword:Funny" -> 1.0, "AppName:YY live broadcast" -> 1.0)),
      (11L, Map("keyword:literature" -> 1.0, "keyword:Current politics" -> 1.0, "AppName:Know" -> 1.0))
    ))

    // 3. Define Edge(srcId: VertexId, dstId: VertexId, attr)
    val edges: RDD[Edge[Int]] = sc.makeRDD(List(
      Edge(1L, 42125L, 0),
      Edge(2L, 42125L, 0),
      Edge(3L, 42125L, 0),
      Edge(4L, 42125L, 0),
      Edge(5L, 42125L, 0),
      Edge(4L, 4567L, 0),
      Edge(5L, 4567L, 0),
      Edge(6L, 4567L, 0),
      Edge(7L, 4567L, 0),
      Edge(9L, 91011, 0),
      Edge(10L, 91011, 0),
      Edge(11L, 91011, 0)
    ))

    // 4. Generate diagram; Generate strong connectivity graph
    Graph(vertexes, edges)
      .connectedComponents()
      .vertices
      .sortBy(_._2)
      .collect()
      .foreach(println)

    // 5. Resource release
    sc.stop()
  }
}

7.4 Spark GraphX ex amp le 2 (user identification & data merging)

According to the previous example, we already know how to identify users according to rules and how the program handles them?

Definition of data:

remarks:

1. The data format defined here is completely consistent with the data format in our program

2. RDD is a tuple, and the first element represents various user IDs; The second element represents the user's label

Task:

1. How many users do the six pieces of data represent

2. Merge data of the same user

    val dataRDD: RDD[(List[String], List[(String, Double)])] = sc.makeRDD(List(
      (List("a1", "b1", "c1"), List("keyword$Beijing" -> 1.0, "keyword$Shanghai" -> 1.0, "area$Zhongguancun" -> 1.0)),
      (List("b1", "c2", "d1"), List("keyword$Shanghai" -> 1.0, "keyword$Tianjin" -> 1.0, "area$Hui Longguan" -> 1.0)),
      (List("d1"), List("keyword$Tianjin" -> 1.0, "area$Zhongguancun" -> 1.0)),
      (List("a2", "b2", "c3"), List("keyword$big data" -> 1.0, "keyword$spark" -> 1.0, "area$Xierqi" -> 1.0)),
      (List("b2", "c4", "d4"), List("keyword$spark" -> 1.0, "area$Wudaokou" -> 1.0)),
      (List("c3", "e3"), List("keyword$hive" -> 1.0, "keyword$spark" -> 1.0, "area$Xierqi" -> 1.0))
    ))

Complete processing steps:

  1. Define vertex
  2. Define edge
  3. Generate graph
  4. Find strong connectome
  5. Find data to merge
  6. Data aggregation
  7. Data merging

Handler:

import org.apache.spark.graphx.{Edge, Graph, VertexId, VertexRDD}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object GraphXDemo {
  def main(args: Array[String]): Unit = {
    // initialization
    val conf: SparkConf = new SparkConf().setAppName("GraphXDemo").setMaster("local")
    val sc = new SparkContext(conf)
    sc.setLogLevel("error")

    // Define data
    val dataRDD: RDD[(List[String], List[(String, Double)])] = sc.makeRDD(List(
      (List("a1", "b1", "c1"), List("kw$Beijing" -> 1.0, "kw$Shanghai" -> 1.0, "area$Zhongguancun" -> 1.0)),
      (List("b1", "c2", "d1"), List("kw$Shanghai" -> 1.0, "kw$Tianjin" -> 1.0, "area$Hui Longguan" -> 1.0)),
      (List("d1"), List("kw$Tianjin" -> 1.0, "area$Zhongguancun" -> 1.0)),
      (List("a2", "b2", "c3"), List("kw$big data" -> 1.0, "kw$spark" -> 1.0, "area$Xierqi" -> 1.0)),
      (List("b2", "c4", "d4"), List("kw$spark" -> 1.0, "area$Wudaokou" -> 1.0)),
      (List("c3", "e3"), List("kw$hive" -> 1.0, "kw$spark" -> 1.0, "area$Xierqi" -> 1.0))
    ))

    val value: RDD[(String, List[String], List[(String, Double)])] = dataRDD.flatMap { case (allIds: List[String], tags: List[(String, Double)]) => {
      allIds.map { case elem: String => (elem, allIds, tags) }
    }
    }

    // 1 extract each element in the identification information as an id
    // Note 1. flatMap is used here to flatten the elements;
    // Note 2. The label information is lost here, because this RDD is mainly used to construct vertices and edges, and the tags information is not used
    // Note 3. Vertex and edge data require Long, so data type conversion is done here
    val dotRDD: RDD[(VertexId, VertexId)] = dataRDD.flatMap { case (allids, tags) =>
      // Method 1: easy to understand, not easy to write
      //      for (id <- allids) yield {
      //        (id.hashCode.toLong, allids.mkString.hashCode.toLong)
      //      }

      // Method 2: not easy to understand, easy to write. Equivalence of two methods
      allids.map(id => (id.hashCode.toLong, allids.mkString.hashCode.toLong))
    }

    // 2 define vertices
    val vertexesRDD: RDD[(VertexId, String)] = dotRDD.map { case (id, ids) => (id, "") }

    // 3 define edges (id: single identification information; ids: all identification information)
    val edgesRDD: RDD[Edge[Int]] = dotRDD.map { case (id, ids) => Edge(id, ids, 0) }

    // 4 generation diagram
    val graph = Graph(vertexesRDD, edgesRDD)

    // 5 find the strong connectome
    val connectRDD: VertexRDD[VertexId] = graph.connectedComponents()
      .vertices

    // 6 data defining the center point
    val centerVertexRDD: RDD[(VertexId, (List[String], List[(String, Double)]))] = dataRDD.map { case (allids, tags) =>
      (allids.mkString.hashCode.toLong, (allids, tags))
    }

    // 7. join the data in steps 5 and 6 to obtain the data to be merged
    val allInfoRDD = connectRDD.join(centerVertexRDD)
      .map { case (id1, (id2, (allIds, tags))) => (id2, (allIds, tags)) }

    // 8 data aggregation (i.e. putting the identity and label of the same user together)
    val mergeInfoRDD: RDD[(VertexId, (List[String], List[(String, Double)]))] = allInfoRDD.reduceByKey { case ((bufferList, bufferMap), (allIds, tags)) =>
      val newList = bufferList ++ allIds
      // Merging of map s
      val newMap = bufferMap ++ tags
      (newList, newMap)
    }

    // 9. Data consolidation (allIds: de duplication; tags: consolidation weight)
    val resultRDD: RDD[(List[String], Map[String, Double])] = mergeInfoRDD.map { case (key, (allIds, tags)) =>
      val newIds = allIds.distinct
      // Aggregate according to key; Then the second element of lst obtained by aggregation is accumulated
      val newTags = tags.groupBy(x => x._1).mapValues(lst => lst.map(x => x._2).sum)
      (newIds, newTags)
    }
    resultRDD.foreach(println)

    sc.stop()
  }

  //  def main(args: Array[String]): Unit = {
  //    val lst = List(
  //      ("kw $big data", 1.0),
  //      ("kw$spark",1.0),
  //      ("area $Xierqi", 1.0),
  //      ("kw$spark",1.0),
  //      ("area $Wudaokou", 1.0),
  //      ("kw$hive",1.0),
  //      ("kw$spark",1.0),
  //      ("area $Xierqi", 1.0)
  //    )
  //
  //    lst.groupBy(x=> x._1).map{case (key, value) => (key, value.map(x=>x._2).sum)}.foreach(println)
  //    println("************************************************************")
  //
  //    lst.groupBy(x=> x._1).mapValues(lst => lst.map(x=>x._2).sum).foreach(println)
  //    println("************************************************************")
}

8. Project summary

Posted by apsomum on Wed, 27 Oct 2021 04:51:05 -0700