Practical combat of user portrait project based on spark

Keywords: IT

Refer to the video tutorial:
**Big data dmp user Portrait System **
Transferred from Qianfeng Teacher Wang Su

1. Introduction to user portrait project

1.1 what is user portrait

The so-called user portrait is to label the user and explain what kind of person the user is.

Specifically, after putting some labels on users, they are divided into different types according to the differences of users' goals, behaviors and views, and then typical features are extracted from each type and given names, photos, some demographic elements, scenes and other descriptions to form a character prototype.

The process is to abstract the customer information into the user portrait, and then abstract the cognition of the customer.

1.2 main dimensions of user portrait

Demographic attribute: who is the user (basic information such as gender, age, occupation, etc.)

Consumer demand: consumer habits and preferences

Purchasing power: income, purchasing power, purchasing frequency and channel

Hobbies: brand preference, personal interest

Social attributes: user activity scenarios (social media, etc.)

1.3 data type of user portrait

Data include dynamic data and static data. The so-called static data, such as gender and age, cannot be changed in a short time; Dynamic data is data related to short-term behavior. For example, today I want to buy a skirt and tomorrow I will go to see pants. This data feature is relatively changeable.

1.4 purpose of user portrait

Kill cooked, recommend (very many) [user portrait is an important data source of the recommendation system], marketing, customer service

Let users and enterprises win-win. Let users quickly find the goods they want, and let enterprises find people who pay for the products.

(1) Micro level

In product design, users' needs are described through user portraits.

In data application, it can be used for recommendation, search and risk control

Combine qualitative analysis and quantitative analysis to conduct data operation and user analysis

Precision marketing

(2) Macro level

Determine the strategic and tactical direction of development

Market segmentation and user clustering, market-oriented

(3) Portrait modeling and prediction

Subdivide population attributes: clarify who, what and why

Purchase Behavior Segmentation: provide key information such as market opportunities and market scale

Product demand segmentation: provide more differentiated and competitive product specifications and business value

Interest and attitude segmentation: provide group category portraits: channel strategy, pricing strategy, product strategy and brand strategy

1.5 steps of user portrait

(1) Determine the target of the portrait

In different product life cycles, or different ways of use, different goals and different needs for portraits, so it is necessary to clarify what the goals and needs are before portrait.

(2) Determine the dimensions of the desired user portrait

Determine the dimensions required by the user's portrait according to the objectives. For example, if you want to recommend goods, you need the factors that can affect the user's choice of goods as the portrait dimension. For example, the user dimension (the user's age and gender will affect the user's choice), the asset dimension (the user's income and other factors will affect the user's choice of price), the behavior dimension (what the user often sees recently should be what he wants to buy), and so on.

(3) Determine the level of the portrait

The more user portrait levels, the smaller the portrait granularity and the clearer the understanding of users. For example, the user dimension can be divided into new users and old users, and then into users' gender, age, etc. This needs to be divided according to the target requirements.

(4) Through the original data, machine learning algorithm is used to label users

Because the original data we get is some disorderly data, we need the algorithm to label users through some features

(5) The tag is transformed into the output of the business through the machine learning algorithm

Everyone will have many tags, which need to be further transformed into the understanding of users. Different weights need to be built for different labels to obtain the output of the business. For example, what kind of products will users with some labels like.

(6) The business generates data, which feeds the business and keeps a closed loop

1.6 common user portrait labels

2. System architecture

2.1 overall architecture (offline projects)

2.2 data processing flow (what to do)

ETL (Extract Tranform Load) is used to describe the process of data extraction, transformation and loading from the source end to the destination end;

ODS (Operational Data Store) operation data store. There is no change in the data of this layer, which directly follows the data structure and data of the peripheral system and is not open to the outside world; It is the temporary storage layer, which is the temporary storage area of interface data to prepare for data processing in the next step.

Main steps to be realized: ETL, report statistics (data analysis), generation of business district library, data tagging (core)

2.3 description of main data set (log data file to be analyzed)

For the integrated log data, one copy per day in json format (offline processing)
This data set integrates internal and external data, as well as bidding information (related to advertising)
There are many columns of data, close to 100

field	explain
ip	Real IP of the device
sessionid	Session ID
advertisersid	Advertiser ID
adorderid	AD ID
adcreativeid	Advertising creative ID (> = 200000: DSP, < 200000 OSS)
adplatformproviderid	Advertising platform provider ID (> = 100000: RTB, < 100000: API)
sdkversionnumber	SDK version number
adplatformkey	Platform provider key
putinmodeltype	For advertisers' delivery mode, 1: display volume delivery, 2: click volume delivery
requestmode	Data request method (1: request, 2: display, 3: click)
adprice	Advertising price
adppprice	Platform provider price
requestdate	Request time, format: yyyy-m-dd hh:mm:ss
appid	Application id
appname	apply name
uuid	Unique identification of the device, such as imei or Android
device	Device model, such as htc, iphone
client	Device type (1: android 2: ios 3: wp)
osversion	Device operating system version, such as 4.0
density	The density of the standby screen is 0.75, 1 and 1.5 for android and 1 and 2 for IOS
pw	Device screen width
ph	Device screen height
provincename	Name of the province where the equipment is located
cityname	Name of the city where the equipment is located
ispid	Operator id
ispname	Operator name
networkmannerid	Networking mode id
networkmannername	Networking mode name
iseffective	Valid ID (valid refers to those that can be charged normally) (0: invalid 1: valid)
isbilling	Charge or not (0: not charged 1: charged)
adspacetype	Advertising space type (1: banner 2: insert screen 3: full screen)
adspacetypename	Advertising space type name (banner, insert screen, full screen)
devicetype	Device type (1: mobile phone 2: tablet)
processnode	Process node (1: request quantity kpi 2: valid request 3: advertisement request)
apptype	Application type id
district	Name of the county where the equipment is located
paymode	For the payment mode of platform providers, 1: display volume delivery (CPM) 2: click volume delivery (CPC)
isbid	rtb
bidprice	rtb bidding price
winprice	Successful rtb bidding price
iswin	Successful bidding
cur	values:usd\|rmb, etc
rate	exchange rate
cnywinprice	rtb bidding successfully converted to RMB price
imei	Mobile phone string code
mac	Mobile phone MAC code
idfa	Advertising code of mobile APP
openudid	Apple device ID
androidid	Android device ID
rtbprovince	rtb Province
rtbcity	rtb City
rtbdistrict	rtb area
rtbstreet	rtb Street
storeurl	app market download address
realip	Real ip
isqualityapp	Preferred identification
bidfloor	floor price
aw	Width of advertising space
ah	Height of advertising space
imeimd5	imei_md5
macmd5	mac_md5
idfamd5	idfa_md5
openudidmd5	openudid_md5
androididmd5	androidid_md5
imeisha1	imei_sha1
macsha1	mac_sha1
idfasha1	idfa_sha1
openudidsha1	openudid_sha1
androididsha1	androidid_sha1
uuidunknow	uuid_ Unknown UUID ciphertext
userid	Platform user id
iptype	Indicates the ip library type. 1 is the point media ip library and 2 is the ip geographic information standard library of the Advertising Association. The default is 1
initbidprice	Initial bid
adpayment	Advertising consumption after conversion (6 decimal places reserved)
agentrate	Agent profit margin
lomarkrate	Agency profit margin
adxrate	Media profit margin
title	title
keywords	keyword
tagid	Advertising space identification (the value is the video ID number when the video traffic occurs)
callbackdate	Callback time format: YYYY/mm/dd hh:mm:ss
channelid	Channel ID
mediatype	media type
email	User email
tel	User telephone number
sex	User gender
age	User age

3. Create the project [ key points of the second day ]

3.1 steps

Create a Maven project to process data

Maven manages all jar s used in the project
*Modify pom.xml file and add:

Define dependent version
Import dependency
Define profile
Create scala directories (in src and test respectively)
Create the package cn.itbigdata.dmp in the scala directory
Write a main program architecture (DMPApp)
Add configuration file dev/application.conf
Add the directory utils and the parameter resolution class ConfigHolder

3.2 modify pom.xml file [ key ]

Set dependent version information

   <properties>
       <scala.version>2.11.8</scala.version>
       <scala.version.simple>2.11</scala.version.simple>
       <hadoop.version>2.6.1</hadoop.version>
       <spark.version>2.3.3</spark.version>
       <hive.version>1.1.0</hive.version>
       <fastjson.version>1.2.44</fastjson.version>
       <geoip.version>1.3.0</geoip.version>
       <geoip2.version>2.12.0</geoip2.version>
       <config.version>1.2.1</config.version>
   </properties>

Import dependencies of the computing engine

        <!-- scala -->
           <dependency>
               <groupId>org.scala-lang</groupId>
               <artifactId>scala-library</artifactId>
               <version>${scala.version}</version>
           </dependency>
           <dependency>
               <groupId>org.scala-lang</groupId>
               <artifactId>scala-xml</artifactId>
               <version>2.11.0-M4</version>
           </dependency>
           <!-- hadoop -->
           <dependency>
               <groupId>org.apache.hadoop</groupId>
               <artifactId>hadoop-client</artifactId>
               <version>${hadoop.version}</version>
           </dependency>
           <!-- spark core -->
           <dependency>
               <groupId>org.apache.spark</groupId>
               <artifactId>spark-core_${scala.version.simple}</artifactId>
               <version>${spark.version}</version>
           </dependency>
           <!-- spark sql -->
           <dependency>
               <groupId>org.apache.spark</groupId>
               <artifactId>spark-sql_${scala.version.simple}</artifactId>
               <version>${spark.version}</version>
           </dependency>
           <!-- spark graphx -->
           <dependency>
               <groupId>org.apache.spark</groupId>
               <artifactId>spark-graphx_${scala.version.simple}</artifactId>
               <version>${spark.version}</version>
           </dependency>

< font color = Red > import dependency of storage engine (can be omitted) < / font >

   <dependency>
       <groupId>org.apache.hive</groupId>
       <artifactId>hive-jdbc</artifactId>
       <exclusions>
           <exclusion>
               <groupId>org.apache.hive</groupId>
               <artifactId>hive-service-rpc</artifactId>
           </exclusion>
           <exclusion>
               <groupId>org.apache.hive</groupId>
               <artifactId>hive-service</artifactId>
           </exclusion>
       </exclusions>
       <version>${hive.version}</version>
   </dependency>

Import tool dependency

   <!-- be used for IP Address translation (longitude, dimension) -->
   <dependency>
       <groupId>com.maxmind.geoip</groupId>
       <artifactId>geoip-api</artifactId>
       <version>${geoip.version}</version>
   </dependency>
   <dependency>
       <groupId>com.maxmind.geoip2</groupId>
       <artifactId>geoip2</artifactId>
       <version>${geoip2.version}</version>
   </dependency>

   <!-- Convert latitude and longitude to code -->
   <dependency>
       <groupId>ch.hsr</groupId>
       <artifactId>geohash</artifactId>
       <version>${geoip.version}</version>
   </dependency>

   <!-- scala analysis json -->
   <dependency>
    <groupId>org.json4s</groupId>
    <artifactId>json4s-jackson_${scala.version.simple}</artifactId>
       <version>3.6.5</version>
   </dependency>

   <!-- Manage profiles -->
   <dependency>
       <groupId>com.typesafe</groupId>
       <artifactId>config</artifactId>
       <version>${config.version}</version>
   </dependency>

< font color = Red > Import compilation configuration (can be omitted) < / font >

      <build>
           <plugins>
               <plugin>
                   <groupId>org.apache.maven.plugins</groupId>
                   <artifactId>maven-compiler-plugin</artifactId>
                   <configuration>
                       <source>1.8</source>
                       <target>1.8</target>
                   </configuration>
               </plugin>

               <plugin>
                   <artifactId>maven-assembly-plugin</artifactId>
                   <executions>
                       <execution>
                           <phase>package</phase>
                           <goals>
                               <goal>single</goal>
                           </goals>
                       </execution>
                   </executions>
                   <configuration>
                       <archive>
                           <manifest>
                               <addClasspath>true</addClasspath>
                               <mainClass>cn.itcast.dmp.processing.App</mainClass>
                           </manifest>
                       </archive>
                       <descriptorRefs>
                           <descriptorRef>jar-with-dependencies</descriptorRef>
                       </descriptorRefs>
                   </configuration>
               </plugin>
           </plugins>
       </build>

3.3 creating scala code packages

1558338673059.png

Create a scala code package under src/main /
Create cn.itbigdata.dmp package in scala package
Create in cn.itbigdata.dmp package
beans (storage class definition)
etl (etl related processing)
Report (report processing)
tradingarea (Business District Library)
tags (tag processing)
customtrait (store interface definition)
utils (storage tool class)
Establish a data directory in the whole project to store the data to be processed

3.4 DmpApp main program (initialization part)

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

// The main program of the project, where relevant tasks are completed
object DmpApp {
  def main(args: Array[String]): Unit = {
    // initialization
    val conf: SparkConf = new SparkConf()
      .setMaster("local")
      .setAppName("DmpApp")
      .set("spark.worker.timeout", "600s")
      .set("spark.cores.max", "10")
      .set("spark.rpc.askTimeout", "600s")
      .set("spark.network.timeout", "600s")
      .set("spark.task.maxFailures", "5")
      .set("spark.speculation", "true")
      .set("spark.driver.allowMultipleContexts", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.buffer.pageSize", "8m")
      .set("park.debug.maxToStringFields", "200")


    val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()

    // close resource
    spark.close()
  }
}

3.5 explanation of spark related parameters

Parameter name	Default value	Define value
spark.worker.timeout	60	500
spark.network.timeout	120s	600s
spark.rpc.askTimeout	spark.network.timeout	600s
spark.cores.max		10
spark.task.maxFailures	4	5
spark.speculation	false	true
spark.driver.allowMultipleContexts	false	true
spark.serializer	org.apache.spark.serializer.JavaSerializer	org.apache.spark.serializer.KryoSerializer
spark.buffer.pageSize	1M - 64M, system calculation	8M

Spark.worker.timeout: the heartbeat is not reported to the master for a long time due to network failure. After spark.worker.timeout (seconds), the master detects the worker exception, identifies it as DEAD, and removes the worker information and the above executor information;
Spark. Network. Timeout: the default timeout for all network interactions. Caused by the network or gc, the worker or executor does not receive heartbeat feedback from the executor or task. Increase the value of spark.network.timeout to 300(5min) or higher as appropriate;
spark.rpc.askTimeout: timeout of RPC call;
spark.cores.max: the maximum number of CPU cores that each application can apply for;
spark.task.maxFailures: when task execution fails, it will not directly cause the whole application to go down. It will only go down if spark.task.maxFailures fail after retries for times;
Spark.specification: speculative execution means that for a slow running task in a Stage, the task will be started again on the executors of other nodes. If one task instance runs successfully, the calculation result of the first completed task will be taken as the final result, and the instances running on other executors will be killed to speed up the running speed;
spark.driver.allowMultipleContexts: there is only one instance of SparkContext by default. If set to true, multiple instances are allowed;
spark.serializer: in the spark architecture, objects delivered in the network or cached in memory and hard disk need to be serialized [sent to tasks on the Executor; RDD to be cached (provided that serialization caching is used); broadcast variables; data caching in the shuffle process, etc.); The performance of the default Java serialization method is not high, and the number of bytes occupied after serialization is also large; Kryo's serialization library is also officially recommended. According to the official documents, the performance of kryo serialization mechanism is about 10 times higher than that of Java serialization mechanism;
spark.buffer.pageSize: the unit of spark memory allocation. There is no default value and the size is between 1M-64M. Spark is calculated according to the jvm heap memory size; The value is too small and the memory allocation efficiency is low; The value is too large, resulting in a waste of memory;

3.6 development environment parameter configuration file

application.conf

// Development environment parameter profile
# App information
spark.appname="dmpApp"

# spark information
spark.master="local[*]"
spark.worker.timeout="120"
spark.cores.max="10"
spark.rpc.askTimeout="600s"
spark.network.timeout="600s"
spark.task.maxFailures="5"
spark.speculation="true"
spark.driver.allowMultipleContexts="true"
spark.serializer="org.apache.spark.serializer.KryoSerializer"
spark.buffer.pageSize="8m"

# kudu information
kudu.master="node1:7051,node2:7051,node3:7051"

# Enter information about the data
addata.path="data/dataset_main.json"
ipdata.geo.path="data/dataset_geoLiteCity.dat"
qqwrydat.path="data/dataset_qqwry.dat"
installDir.path="data"

# Corresponding ETL output information
ods.prefix="ods"
ad.data.tablename="adinfo"

# The output report corresponds to 7 Analysis: regional statistics, advertising region, APP, equipment, network, operator and channel
report.region.stat.tablename="RegionStatAnalysis"
report.region.tablename="AdRegionAnalysis"
report.app.tablename="AppAnalysis"
report.device.tablename="DeviceAnalysis"
report.network.tablename="NetworkAnalysis"
report.isp.tablename="IspAnalysis"
report.channel.tablename="ChannelAnalysis"

# Gaode API
gaoDe.app.key="a94274923065a14222172c9b933f4a4e"
gaoDe.url="https://restapi.amap.com/v3/geocode/regeo?"

# Geohash (length of key)
geohash.key.length=10

# Business District Library
trading.area.tablename="tradingArea"

# tags
non.empty.field="imei,mac,idfa,openudid,androidid,imeimd5,macmd5,idfamd5,openudidmd5,androididmd5,imeisha1,macsha1,idfasha1,openudidsha1,androididsha1"
appname.dic.path="data/dic_app"
device.dic.path="data/dic_device"
tags.table.name.prefix="tags"

# Label attenuation factor
tag.coeff="0.92"

# es related parameters
es.cluster.name="cluster_es"
es.index.auto.create="true"
es.Nodes="192.168.40.164"
es.port="9200"
es.index.reads.missing.as.empty="true"
es.nodes.discovery="false"
es.nodes.wan.only="true"
es.http.timeout="2000000"

3.7 configuration file parsing class

// Parsing parameter file help class
import com.typesafe.config.ConfigFactory

object ConfigHolder {
  private val config = ConfigFactory.load()
  //  App Info
  lazy val sparkAppName: String = config.getString("spark.appname")

  // Spark parameters
  lazy val sparkMaster: String = config.getString("spark.master")

  lazy val sparkParameters: List[(String, String)] = List(
    ("spark.worker.timeout", config.getString("spark.worker.timeout")),
    ("spark.cores.max", config.getString("spark.cores.max")),
    ("spark.rpc.askTimeout", config.getString("spark.rpc.askTimeout")),
    ("spark.network.timeout", config.getString("spark.network.timeout")),
    ("spark.task.maxFailures", config.getString("spark.task.maxFailures")),
    ("spark.speculation", config.getString("spark.speculation")),
    ("spark.driver.allowMultipleContexts", config.getString("spark.driver.allowMultipleContexts")),
    ("spark.serializer", config.getString("spark.serializer")),
    ("spark.buffer.pageSize", config.getString("spark.buffer.pageSize"))
  )

  // kudu parameters
  lazy val kuduMaster: String = config.getString("kudu.master")

  // input dataset
  lazy val adDataPath: String = config.getString("addata.path")
  lazy val ipsDataPath: String = config.getString("ipdata.geo.path")
  def ipToRegionFilePath: String = config.getString("qqwrydat.path")
  def installDir: String = config.getString("installDir.path")

  // output dataset
  private lazy val delimiter = "_"
  private lazy val odsPrefix: String = config.getString("ods.prefix")
  private lazy val adInfoTableName: String = config.getString("ad.data.tablename")
//  lazy val ADMainTableName = s"$odsPrefix$delimiter$adInfoTableName$delimiter${DateUtils.getTodayDate()}"

  // report
  lazy val Report1RegionStatTableName: String = config.getString("report.region.stat.tablename")
  lazy val ReportRegionTableName: String = config.getString("report.region.tablename")
  lazy val ReportAppTableName: String = config.getString("report.app.tablename")
  lazy val ReportDeviceTableName: String = config.getString("report.device.tablename")
  lazy val ReportNetworkTableName: String = config.getString("report.network.tablename")
  lazy val ReportIspTableName: String = config.getString("report.isp.tablename")
  lazy val ReportChannelTableName: String = config.getString("report.channel.tablename")

  // GaoDe API
  private lazy val gaoDeKey: String = config.getString("gaoDe.app.key")
  private lazy val gaoDeTempUrl: String = config.getString("gaoDe.url")
  lazy val gaoDeUrl: String = s"$gaoDeTempUrl&key=$gaoDeKey"

  // GeoHash
  lazy val keyLength: Int = config.getInt("geohash.key.length")

  // Business District Library
  lazy val tradingAreaTableName: String =config.getString("trading.area.tablename")

  // tags
  lazy val idFields: String = config.getString("non.empty.field")
  lazy val filterSQL: String = idFields
    .split(",")
    .map(field => s"$field is not null ")
    .mkString(" or ")
  lazy val appNameDic: String = config.getString("appname.dic.path")
  lazy val deviceDic: String = config.getString("device.dic.path")
  lazy val tagsTableNamePrefix: String = config.getString("tags.table.name.prefix") + delimiter
  lazy val tagCoeff: Double = config.getDouble("tag.coeff")

  // Load elasticsearch related parameters
  lazy val ESSparkParameters = List(
    ("cluster.name", config.getString("es.cluster.name")),
    ("es.index.auto.create", config.getString("es.index.auto.create")),
    ("es.nodes", config.getString("es.Nodes")),
    ("es.port", config.getString("es.port")),
    ("es.index.reads.missing.as.empty", config.getString("es.index.reads.missing.as.empty")),
    ("es.nodes.discovery", config.getString("es.nodes.discovery")),
    ("es.nodes.wan.only", config.getString("es.nodes.wan.only")),
    ("es.http.timeout", config.getString("es.http.timeout"))
  )

  def main(args: Array[String]): Unit = {
    println(ConfigHolder.sparkParameters)
    println(ConfigHolder.installDir)
  }
}

3.8 DmpApp main program (using configuration file)

import cn.itbigdata.dmp.utils.ConfigHolder
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object DmpApp {
  def main(args: Array[String]): Unit = {
    // 1. Initialization (SparkConf, SparkSession)
    val conf = new SparkConf()
      .setAppName(ConfigHolder.sparkAppName)
      .setMaster(ConfigHolder.sparkMaster)
      .setAll(ConfigHolder.sparkParameters)

    val spark: SparkSession = SparkSession.builder()
      .config(conf)
      .getOrCreate()
    spark.sparkContext.setLogLevel("warn")
    println("OK!")

    // 1,ETL

    // 2. Statement

    // 3. Generate business district library

    // 4. Tagging

    // close resource
    spark.close()
  }
}

4. ETL development

Requirements:

Convert the ip address in each line of the data file into longitude, dimension, province and city information;

IP = > longitude, dimension, province, city
*Save the converted data file (one file per day)

Processing steps:

Read data
data processing
Find the ip address in each row of data
According to the ip address, calculate the corresponding province, city, longitude and latitude, and add them to the tail of each line of data
Save data
Other requirements: the data is loaded once a day, and the daily data is stored in a separate file

Difficult problem: data processing (how to convert IP address into province, city, longitude and latitude)

4.1 build ETL architecture

Create a new trait (Processor) to provide a unified interface class for data processing

import org.apache.spark.sql.SparkSession

// Data processing interface
// SparkSession is used for data loading and processing
// KuduContext is used to save data
trait Processor {
  def process(spark: SparkSession)
}

Create a new ETLProcessor to be responsible for ETL processing

import cn.itbigdata.dmp.customtrait.Processor
import org.apache.kudu.spark.kudu.KuduContext
import org.apache.spark.sql.SparkSession

object ETLProcessor extends Processor{
  override def process(spark: SparkSession): Unit = {
    // Define parameters
    val sourceDataFile = ConfigHolder.adLogPath
    val sinkDataPath = ""

    // 1 read data
    val sourceDF: DataFrame = spark.read.json(sourceDataFile)

    // 2 processing data
    // 2.1 find ip
    // 2.2 convert ip to province, city, longitude and dimension
    val rdd = sourceDF.rdd
      .map(row => {
        val ip: String = row.getAs[String]("ip")
        ip
      })

    // 2.3 put the province, city, longitude and dimension at the end of the original data

    // 3 save data
  }
}

4.2 IP address conversion to latitude and longitude

Use GeoIP to convert the ip address to latitude and longitude
GeoIP is a set of software tools with IP database
Geo locates the location information such as longitude and latitude, country / region, province, city and street of the IP according to the visitor's IP
GeoIP is available in two versions, a free version and a paid version
The accuracy of the charged version is higher and the update frequency is more frequent
Because GeoIP reads the local binary IP database, it is very efficient

4.3 IP address conversion

Pure database, Turn ip into provincial and municipal
The pure database collects IP address data of ISP s including China Telecom, China Mobile, China Unicom, Great Wall broadband and Juyou broadband
The pure database is a binary file with open source java code. It can be simply modified and called

case class Location(ip: String, region: String, city: String, longitude: Float, latitude: Float)

  private def ipToLocation(ip: String): Location ={
    // 1 get service
    val service = new LookupService("data/geoLiteCity.dat")

    // 2 get Location
    val longAndLatLocation = service.getLocation(ip)

    // 3. Obtain longitude and dimension
    val longitude = longAndLatLocation.longitude
    val latitude = longAndLatLocation.latitude

    // 4 use the innocence database to obtain provincial and municipal information
    val ipService = new IPAddressUtils
    val regeinLocation: IPLocation = ipService.getregion(ip)
    val region = regeinLocation.getRegion
    val city = regeinLocation.getCity

    Location(ip, region, city, longitude, latitude)
  }

Need to implement help class:

Calculate the date of the day

  import java.util.{Calendar, Date}
  import org.apache.commons.lang.time.FastDateFormat

  object DateUtils {
    def getToday: String = {
      val now = new Date
      FastDateFormat.getInstance("yyyyMMdd").format(now)
  }

    def getYesterday: String = {
      val calendar: Calendar = Calendar.getInstance
      calendar.set(Calendar.HOUR_OF_DAY, -24)
      FastDateFormat.getInstance("yyyyMMdd").format(calendar.getTime())
    }

    def main(args: Array[String]): Unit = {
      println(getToday)
      println(getYesterday)
    }
  }

4.4 complete realization of ETL

import java.util.Calendar
import cn.itbigdata.dmp.customtrait.Processor
import cn.itbigdata.dmp.util.iplocation.{IPAddressUtils, IPLocation}
import com.maxmind.geoip.LookupService
import org.apache.commons.lang3.time.FastDateFormat
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

object ETLProcessor extends Processor{
  // Define parameters
  private val sourceDataFile: String = "data/data.json"
  private val sinkDataPath: String = s"outputdata/maindata.${getYesterday}"
  private val geoFilePath: String = "data/geoLiteCity.dat"

  override def process(spark: SparkSession): Unit = {
    // 1 read data
    val sourceDF: DataFrame = spark.read.json("data/data.json")

    // 2 processing data
    // 2.1 find ip
    // 2.2 convert ip to province, city, longitude and dimension
    import spark.implicits._
    val ipDF: DataFrame = sourceDF.rdd
      .map { row =>
        val ip = row.getAs[String]("ip")
        // Convert ip to province, city, longitude and latitude
        ip2Location(ip)
      }.toDF

    // 2.2 join IPDF and sourceDF to add Province, city, longitude and latitude to each line
    val sinkDF: DataFrame = sourceDF.join(ipDF, Seq("ip"), "inner")

    // 3 save data
    sinkDF.write.mode(SaveMode.Overwrite).json(sinkDataPath)
  }

  case class Location(ip: String, region: String, city: String, longitude: Float, latitude: Float)

  private def ip2Location(ip: String): Location ={
    // 1 get service
    val service = new LookupService(geoFilePath)

    // 2 get Location
    val longAndLatLocation = service.getLocation(ip)

    // 3. Obtain longitude and dimension
    val longitude = longAndLatLocation.longitude
    val latitude = longAndLatLocation.latitude

    // 4 use the innocence database to obtain provincial and municipal information
    val ipService = new IPAddressUtils
    val regionLocation: IPLocation = ipService.getregion(ip)
    val region = regionLocation.getRegion
    val city = regionLocation.getCity

    Location(ip, region, city, longitude, latitude)
  }

  private def getYesterday: String = {
    val calendar: Calendar = Calendar.getInstance
    calendar.set(Calendar.HOUR_OF_DAY, -24)
    FastDateFormat.getInstance("yyyyMMdd").format(calendar.getTime())
  }
}

5. Report development (data analysis – SparkSQL)

Reports to be processed

Count the quantity distribution of each region (RegionStatProcessor)
Statistics of regional distribution of advertising (region analysis processor)
APP distribution statistics (AppAnalysisProcessor)
Mobile device type distribution statistics (DeviceAnalysisProcessor)
Network type distribution statistics (network analysis processor)
Distribution statistics of network operators (IspAnalysisProcessor)
Channel distribution statistics (ChannelAnalysisProcessor)

5.1 geographical distribution of data

Steps of report processing
Understand business requirements: find out the distribution of data volume according to provincial and municipal grouping
Source data: daily log data, i.e. ETL result data;
Target data: saved in a local file, and each report corresponds to a directory;
Write SQL and test
code implementation
Define RegionStatProcessor, which inherits from Processor and implements process method. The specific implementation steps are as follows:

import cn.itbigdata.dmp.customtrait.Processor
import cn.itbigdata.dmp.utils.{ConfigHolder, DateUtils}
import org.apache.spark.sql.{SaveMode, SparkSession}

object RegionStatProcessor extends Processor{
  override def process(spark: SparkSession): Unit = {
    // Define parameters
    val sourceDataPath = s"outputdata/maindata-${DateUtils.getYesterday}"
    val sinkDataPath = "output/regionstat"

    // read file
    val sourceDF = spark.read.json(sourceDataPath)
    sourceDF.createOrReplaceTempView("adinfo")

    // Processing data
    val RegionSQL1 =
      """
        |select to_date(now()) as statdate, region, city, count(*) as infocount
        |  from adinfo
        |group by region, city
        |""".stripMargin

    val sinkDF = spark.sql(RegionSQL1)
    sinkDF.show()

    // Write file
    sinkDF.coalesce(1).write.mode(SaveMode.Append).json(sinkDataPath)
  }
}

5.2 regional distribution of advertising

Complete the report in the following modes as required

Note: three rates are required: bidding success rate, advertising click through rate and media click through rate

Index calculation logic

index	explain	adplatformproviderid	requestmode	processnode	iseffective	isbilling	isbid	iswin	adorderid	adcreativeid
Original request	Number of all original requests sent		1	>=1
Valid request	Number of valid physical examinations		1	>=2
Advertising request	Number of requests that meet advertising requests		1	3
Number of bidding participants	Times of participating in bidding	>=100000			1	1	1		!=0
Number of successful bidding	Number of successful bidding	>=100000			1	1		1
(Advertiser) number of displays	Statistics for advertisers: the number of advertisements finally displayed on the terminal		2		1
(Advertiser) hits	Statistics for advertisers: the actual number of clicks after the advertisement is displayed		3		1
Number of (media) displays	For media statistics: the number of advertisements displayed at the terminal		2		1	1
(media) hits	For media statistics: the number of actually clicked advertisements displayed		3		1	1
DSP advertising consumption	winprice/1000	>=100000			1	1		1	>200000	>200000
DSP advertising cost	Adptment/1000	>=100000			1	1		1	>200000	>200000

DSP advertising consumption = DSP RTB money

DSP advertising cost = money paid by advertisers to DSP

DSP's profit = DSP advertising cost - DSP advertising consumption

Note: corresponding fields:

OriginalRequest,ValidRequest,adRequest
bidsNum,bidsSus,bidRate
adDisplayNum,adClickNum,adClickRate
MediumDisplayNum,MediumClickNum,MediumClickRate
adconsume,adcost

code implementation

import cn.itbigdata.dmp.customtrait.Processor
import cn.itbigdata.dmp.utils.DateUtils
import org.apache.spark.sql.{SaveMode, SparkSession}

object RegionAnalysisProcessor extends Processor{
  override def process(spark: SparkSession): Unit = {
    // Define parameters
    val sourceDataPath = s"outputdata/maindata-${DateUtils.getYesterday}"
    val sinkDataPath = "outputdata/regionanalysis"

    // read file
    val sourceDF = spark.read.json(sourceDataPath)
    sourceDF.createOrReplaceTempView("adinfo")

    // Processing data
    val RegionSQL1 =
      """
        |select to_date(now()) statdate, region, city,
        |       sum(case when requestmode=1 and processnode>=1 then 1 else 0 end) as OriginalRequest,
        |       sum(case when requestmode=1 and processnode>=2 then 1 else 0 end) as ValidRequest,
        |       sum(case when requestmode=1 and  processnode=3 then 1 else 0 end) as adRequest,
        |       sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and isbid=1 and adorderid!=0
        |                then 1 else 0 end) as bidsNum,
        |       sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1
        |                then 1 else 0 end) as bidsSus,
        |       sum(case when requestmode=2 and iseffective=1 then 1 else 0 end) as adDisplayNum,
        |       sum(case when requestmode=3 and iseffective=1 then 1 else 0 end) as adClickNum,
        |       sum(case when requestmode=2 and iseffective=1 and isbilling=1 then 1 else 0 end) as MediumDisplayNum,
        |       sum(case when requestmode=3 and iseffective=1 and isbilling=1 then 1 else 0 end) as MediumClickNum,
        |       sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1
        |                 and iswin=1 and adorderid>200000 and adcreativeid>200000
        |                then winprice/1000 else 0 end) as adconsume,
        |       sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1
        |                 and iswin=1 and adorderid>200000 and adcreativeid>200000
        |                then adpayment/1000 else 0 end) as adcost
        |  from adinfo
        |group by region, city
        |    """.stripMargin
    spark.sql(RegionSQL1).createOrReplaceTempView("tabtemp")

    val RegionSQL2 =
      """
        |select statdate, region, city,
        |       OriginalRequest, ValidRequest, adRequest,
        |       bidsNum, bidsSus, bidsSus/bidsNum as bidRate,
        |       adDisplayNum, adClickNum, adClickNum/adDisplayNum as adClickRate,
        |       MediumDisplayNum, MediumClickNum, MediumClickNum/MediumDisplayNum as mediumClickRate,
        |       adconsume, adcost
        |  from tabtemp
    """.stripMargin
    val sinkDF = spark.sql(RegionSQL2)

    // Write file
    sinkDF.coalesce(1).write.mode(SaveMode.Append).json(sinkDataPath)
  }
}

6. Data tagging

6.1 what is data tagging

Why label data
Analyze data requirements
The user's demand for data search supports the conditional screening of targeted people. For example:
- Region, even business district
- Gender
- Age
- interest
- equipment
data format

Target data: (user id, all tags). The label is as follows:

( CH@123485 -> 1.0, KW@word -> 1.0, CT@Beijing ->1.0, Gd @ female - > 1.0, AGE@40 ->1.0, Ta @ Beihai - > 1.0, Ta @ Beach - > 1.0)
*Tag data organization form Map[String, Double]
*Prefix + label; 1.0 is the weight
*Labels to be made

Advertising type
channel
App name
Gender
geographical position
equipment
key word
Age
Business district (temporarily ignored)
Tagging of log data
Calculate tags (AD type, channel, AppName, gender...)
Extract user ID
Unified user identification
Label data landing

6.2 framing

object TagProcessor extends Processor{
  override def process(spark: SparkSession, kudu: KuduContext): Unit = {
    // Define parameters
    val sourceTableName = ConfigHolder.ADMainTableName
    val sinkTableName = ""
    val keys = ""

    // 1 read data
    val sourceDF = spark.read
      .option("kudu.master", ConfigHolder.kuduMaster)
      .option("kudu.table", sourceTableName)
      .kudu

    // 2 processing data
    sourceDF.rdd
      .map(row => {
        // Advertising type, channel and App name
        val adTags = AdTypeTag.make(row)

        // Gender, geographical location, equipment

        // Keywords, age, business district

      })

    // 3 save data

  }
}

Define interface classes:

import org.apache.spark.sql.Row

trait TagMaker {
  def make(row: Row, dic: collection.Map[String, String]=null): Map[String, Double]
}

6.3 labeling

6.3.1 ad type (AdTypeTag)

Field meaning 1: banner; 2: Insert screen; 3: Full screen

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.spark.sql.Row

object AdTypeTag extends TagMaker{
  private val adPrefix = "adtype@"

  override def make(row: Row, dic: collection.Map[String, String]): Map[String, Double] = {
    // 1 value
    val adType: Long = row.getAs[Long]("adspacetype")
    // 1: banner；  2: Insert screen; 3: Full screen

    // 2 Calculation label
    adType match {
      case 1 => Map(s"${adPrefix}banner" -> 1.0)
      case 2 => Map(s"${adPrefix}Insert screen" -> 1.0)
      case 3 => Map(s"${adPrefix}Full screen" -> 1.0)
      case _ => Map[String, Double]()
    }
  }
}

6.3.2 channel (ChannelTag)

Field: channelid

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object ChannelTag extends TagMaker{
  private val channelPrefix = "channel@"

  override def make(row: Row, dic: collection.Map[String, String]=null): Map[String, Double] = {
    // 1 value
    val channelid = row.getAs[String]("channelid")
    // 2 Calculation label
    if (StringUtils.isNotBlank(channelid)){
      Map(s"${channelPrefix}channelid" -> 1.0)
    }
    else
      Map[String, Double]()
  }

  def main(args: Array[String]): Unit = {
    // Judge whether a string is not empty, the length is not 0, and is not composed of white space (space)
    if (!StringUtils.isNotBlank(null)) println("blank1 !")
    if (!StringUtils.isNotBlank("")) println("blank2 !")
    if (!StringUtils.isNotBlank("   ")) println("blank3 !")
  }
}

6.3.3 App name (AppNameTag)

Field: appid; To convert appid to appname

Look up the given dictionary table: dicapp

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object AppNameTag extends TagMaker{
  // Get prefix
  private val appNamePrefix = "appname@"

  override def make(row: Row, dic: collection.Map[String, String]): Map[String, Double] = {
    // 1 get location information
    val appId = row.getAs[String]("appid")

    // 2 calculate and return labels
    val appName = dic.getOrElse(appId, "")
    if (StringUtils.isNotBlank(appName))
      Map(s"${appNamePrefix}$appName" -> 1.0)
    else
      Map[String, Double]()
  }
}

6.3.4 gender (SexTag)

Field: sex;

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object SexTag extends TagMaker{
  private val sexPrefix: String = "sex@"

  override def make(row: Row, dic: collection.Map[String, String]=null): Map[String, Double] = {
    // Get tag information
    val sexid: String = row.getAs[String]("sex")
    val sex = sexid match {
      case "1" => "male"
      case "2" => "female"
      case _   => "To be filled in"
    }

    // Calculation return label
    if (StringUtils.isNotBlank(sex))
      Map(s"$sexPrefix$sex" -> 1.0)
    else
      Map[String, Double]()
  }
}

6.3.5 geographic location (GeoTag)

Fields: region, city

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object GeoTag extends TagMaker{
  private val regionPrefix = "region@"
  private val cityPrefix = "city@"

  override def make(row: Row, dic: collection.Map[String, String]=null): Map[String, Double] = {
    // Get tag information
    val region = row.getAs[String]("region")
    val city = row.getAs[String]("city")

    // Calculate and return label information
    val regionTag = if (StringUtils.isNotBlank(region))
      Map(s"$regionPrefix$region" -> 1.0)
    else
      Map[String, Double]()

    val cityTag = if (StringUtils.isNotBlank(city))
      Map(s"$cityPrefix$city" -> 1.0)
    else
      Map[String, Double]()

    regionTag ++ cityTag
  }
}

6.3.6 equipment (DeviceTag)

Fields: client, networkmannername, ispname;

Data dictionary: dicdevice

client: device type (1: android 2: ios 3: wp 4: others)
networkmannername: name of networking mode (2G, 3G, 4G, others)
ispname: operator name (Telecom, China Mobile, China Unicom...)

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object DeviceTag extends TagMaker{
  val clientPrefix = "client@"
  val networkPrefix = "network@"
  val ispPrefix = "isp@"

  override def make(row: Row, dic: collection.Map[String, String]): Map[String, Double] = {
    // Get tag information
    val clientName: String = row.getAs[Long]("client").toString
    val networkName: String = row.getAs[Long]("networkmannername").toString
    val ispName: String = row.getAs[Long]("ispname").toString

    // Calculate and return labels
    val clientId = dic.getOrElse(clientName, "D00010004")
    val networkId = dic.getOrElse(networkName, "D00020005")
    val ispId = dic.getOrElse(ispName, "D00030004")

    val clientTag = if (StringUtils.isNotBlank(clientId))
      Map(s"$clientPrefix$clientId" -> 1.0)
    else
      Map[String, Double]()

    val networkTag = if (StringUtils.isNotBlank(networkId))
      Map(s"$networkPrefix$networkId" -> 1.0)
    else
      Map[String, Double]()

    val ispTag = if (StringUtils.isNotBlank(ispId))
      Map(s"$ispPrefix$ispId" -> 1.0)
    else
      Map[String, Double]()

    clientTag ++ networkTag ++ ispTag
  }
}

6.3.7 keywords (KeywordTag)

Fields: keywords

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object KeywordsTag extends TagMaker{
  private val keywordPrefix = "keyword@"

  override def make(row: Row, dic: collection.Map[String, String]): Map[String, Double] = {
    row.getAs[String]("keywords")
      .split(",")
      .filter(word => StringUtils.isNotBlank(word))
      .map(word => s"$keywordPrefix$word" -> 1.0)
      .toMap
  }
}

6.3.8 age (AgeTag)

Field: age

import cn.itbigdata.dmp.customtrait.TagMaker
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row

object AgeTag extends TagMaker{
  private val agePrefix = "age@"

  override def make(row: Row, dic: collection.Map[String, String]): Map[String, Double] = {
    val age = row.getAs[String]("age")

    if (StringUtils.isNotBlank(age))
      Map(s"$agePrefix$age" -> 1.0)
    else
      Map[String, Double]()
  }
}

6.3.9 main handler (TagProcessor)

import cn.itbigdata.dmp.customtrait.Processor
import cn.itbigdata.dmp.utils.DateUtils
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.SparkSession

object TagProcessor extends Processor{
  override def process(spark: SparkSession): Unit = {
    // Define parameters
    val sourceTableName = s"outputdata/maindata-${DateUtils.getYesterday}"
    val appdicFilePath = "data/dicapp"
    val deviceFilePath = "data/dicdevice"
    val sinkTableName = ""

    // 1 read data
    val sourceDF = spark.read.json(sourceTableName)

    // Read app information (file) and convert it into broadcast variable (optimization)
    val appdicMap = spark.sparkContext.textFile(appdicFilePath)
      .map(line => {
        val arr: Array[String] = line.split("##")
        (arr(0), arr(1))
      }).collectAsMap()
    val appdicBC: Broadcast[collection.Map[String, String]] = spark.sparkContext.broadcast(appdicMap)

    // Read dictionary information (file) and convert it to broadcast variables (optimization)
    val deviceMap = spark.sparkContext.textFile(deviceFilePath)
      .map(line => {
        val arr: Array[String] = line.split("##")
        (arr(0), arr(1))
      }).collectAsMap()
    val deviceBC: Broadcast[collection.Map[String, String]] = spark.sparkContext.broadcast(deviceMap)

    // 2 processing data
    sourceDF.printSchema()
    sourceDF.rdd
      .map(row => {
        // Advertising type, channel and App name
        val adTags: Map[String, Double] = AdTypeTag.make(row)
        val channelTags: Map[String, Double] = ChannelTag.make(row)
        val appNameTags: Map[String, Double] = AppNameTag.make(row, appdicBC.value)

        // Gender, geographical location, equipment type
        val sexTags = SexTag.make(row)
        val geoTags = GeoTag.make(row)
        val deviceTags = DeviceTag.make(row, deviceBC.value)

        // Keywords, age
        val keywordsTags = KeywordsTag.make(row)
        val ageTags = AgeTag.make(row)

        // Return all data into a large Map
        val tags = adTags ++ channelTags ++ appNameTags ++ sexTags ++ geoTags ++ deviceTags ++ keywordsTags ++ ageTags

        tags
      }).collect.foreach(println)

    // 3 save data

  }
}

6.4 extracting user ID

The log data is specific to a user's single browsing behavior
A user may have multiple pieces of data in a day
The label is for people
Existing problems:
Extract the concept of people from the data set so that one person can correspond to one piece of data
The available user id cannot be found in the log information, so we can only go back to the second place, find the information of the device, and identify the user with the information of the device:
- IMEI: International Mobile Equipment Identity (IMEI), commonly known as mobile phone serial number and mobile phone "serial number", is used to identify each independent mobile phone and other mobile communication equipment in the mobile phone network, which is equivalent to the ID card of the mobile phone. IMEI is written on the motherboard. Reinstalling the APP will not change IMEI. Android 6.0 and above systems require the user to grant read_phone_state permission, which cannot be obtained if the user refuses;
- IDFA: launched in iOS 6, it can monitor the advertising effect and ensure that the user equipment is not tracked by the APP. Changes may occur, such as system reset and restoring the advertisement identifier in the settings. Users can open "restricted advertisement tracking" in settings;
- mac address: hardware identifier, including WiFi mac address and Bluetooth mac address. After iOS 7, it is prohibited;
- OpenUDID: when iOS 5 was released, UDID was abandoned, which caused the majority of developers to find a solution that can replace UDID and is not controlled by apple. Thus, OpenUDID became the most widely used alternative to open source UDID at that time. OpenUDID is very simple to implement in the project, and also supports a series of advertising providers;
- Android ID: when the device is started for the first time, the system will randomly generate a 64 bit number and save it in the form of hexadecimal string, which is ANDROID_ID, which will be reset when the device is wipe;
The fields in the log data that can be used to identify users include:
- imei,mac,idfa,openudid,androidid
- imeimd5,macmd5,idfamd5,openudidmd5,androididmd5
- imeisha1,macsha1,idfasha1,openudidsha1,androididsha1
What is invalid data: if all the above 15 fields are empty, this data cannot be associated with any user. This data is useless to us. It is invalid data. This data needs to be removed.

  // Filtering is required when 15 fields are empty at the same time
  lazy val filterSQL: String = idFields
      .split(",")
      .map(field => s"$field != ''")
      .mkString(" or ")

          // Extract user ID
          val userIds = getUserIds(row)

          // Return label
          (userIds.head, (userIds, tags))

    // Extract user ID
    private def getUserIds(row: Row): List[String] = {
      val userIds: List[String] = idFields.split(",")
        .map(field => (field, row.getAs[String](field)))
        .filter { case (key, value) => StringUtils.isNotBlank(value) }
        .map { case (key, value) => s"$key::$value" }.toList

      userIds
    }

6.5 user identification

Use fifteen fields (non empty) to jointly identify users
During data collection:
The data collected each time may be different fields
Some fields may also change
How to identify the data of the same user?

6.6 user identification & data aggregation and consolidation

  // Unified user identification; Data aggregation and consolidation
  private def graphxAnalysis(rdd: RDD[(List[String], List[(String, Double)])]): RDD[(List[String], List[(String, Double)])] ={
    // 1 define vertices (data structure: Long, ""; algorithm: each element in the List can be used as a vertex, and the List itself can also be used as a vertex)
    val dotsRDD: RDD[(String, List[String])] = rdd.flatMap{ case (lst1, _) => lst1.map(elem => (elem, lst1)) }
    val vertexes: RDD[(Long, String)] = dotsRDD.map { case (id, ids) => (id.hashCode.toLong, "") }

    // 2 define edges (data structure: Edge(Long, Long, 0))
    val edges: RDD[Edge[Int]] = dotsRDD.map { case (id, ids) => Edge(id.hashCode.toLong, ids.mkString.hashCode.toLong, 0) }

    // 3 generation diagram
    val graph = Graph(vertexes, edges)

    // 4 strongly connected graph
    val idRDD: VertexRDD[VertexId] = graph.connectedComponents()
      .vertices

    // 5 definition data (ids and tags)
    val idsRDD: RDD[(VertexId, (List[String], List[(String, Double)]))] =
      rdd.map { case (ids, tags) => (ids.mkString.hashCode.toLong, (ids, tags)) }

    // 6. join the results of step 4 with the results of step 5, and classify all the data [one user, one classification]
    val joinRDD: RDD[(VertexId, (List[String], List[(String, Double)]))] = idRDD.join(idsRDD)
      .map { case (key, value) => value }

    // 7 data aggregation (data of the same user are put together)
    val aggRDD: RDD[(VertexId, (List[String], List[(String, Double)]))] = joinRDD.reduceByKey { case ((bufferIds, bufferTags), (ids, tags)) =>
      (bufferIds ++ ids, bufferTags ++ tags)
    }

    // 8. Data consolidation (de duplication for id and consolidation weight for tags)
    val resultRDD: RDD[(List[String], List[(String, Double)])] = aggRDD.map { case (key, (ids, tags)) =>
      val newTags = tags.groupBy(x => x._1)
        .mapValues(lst => lst.map { case (word, weight) => weight }.sum)
        .toList
      (ids.distinct, newTags)
    }

    resultRDD
  }

6.7 label landing

When saving data to kudu, please note:

1. Save a table every day (you need to create a new table). Table name: usertags_ Date of the day

2. Data type conversion RDD [(list [string], list [(string, double)])] = > RDD [(string, string)] = > dataframe

Convert List[String] to String; Pay attention to the definition of separator
Convert List[(String, Double)] to String. Pay attention to the definition of separator
Separator: cannot be duplicate with the symbol in the data; The separator must be able to be added and removed.

    // 3 data landing (kudu)
    // Change the List data type to String
    import spark.implicits._
    val resultDF = mergeRDD.map{ case (ids, tags) =>
      (ids.mkString("||"), tags.map{case (key, value) => s"$key->$value"}.mkString("||"))
    }.toDF("ids", "tags")
    DBUtils.appendData(kudu, resultDF, sinkTableName, keys)
  }

// Get yesterday's date

6.8 tag processor

package cn.itcast.dmp.tags

import cn.itcast.dmp.Processor
import cn.itcast.dmp.utils.ConfigHolder
import org.apache.commons.lang3.StringUtils
import org.apache.kudu.spark.kudu.{KuduContext, KuduDataFrameReader}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}

object TagProcessor extends Processor{
  override def process(spark: SparkSession, kudu: KuduContext): Unit = {
    // Define parameters
    val sourceTableName = ConfigHolder.ADMainTableName
    val sinkTableName = ""
    val keys = ""
    val dicAppPath = ConfigHolder.appNameDic
    val dicDevicePath = ConfigHolder.deviceDic
    val tradingAreaTableName = ConfigHolder.tradingAreaTableName
    val filterSQL = ConfigHolder.filterSQL
    val idFields = ConfigHolder.idFields

    // 1 read data
    val sc = spark.sparkContext
    val sourceDF = spark.read
      .option("kudu.master", ConfigHolder.kuduMaster)
      .option("kudu.table", sourceTableName)
      .kudu

    // Read app dictionary information
    val appDic = sc.textFile(dicAppPath)
      .map(line => {
        val arr = line.split("##")
        (arr(0), arr(1))
      })
      .collect()
      .toMap
    val appDicBC = sc.broadcast(appDic)

    // Read device dictionary information
    val deviceDic = sc.textFile(dicDevicePath)
      .map(line => {
        val arr = line.split("##")
        (arr(0), arr(1))
      })
      .collect()
      .toMap
    val deviceDicBC = sc.broadcast(deviceDic)

    // Read business district information (read; filter; convert to rdd; retrieve data; collect data to driver; convert to map)
    // Restrictions: the information in the business district table should not be too large (the filtered size should be less than 20M)
    val tradingAreaDic: Map[String, String] = spark.read
      .option("kudu.master", ConfigHolder.kuduMaster)
      .option("kudu.table", tradingAreaTableName)
      .kudu
      .filter("areas!=''")
      .rdd
      .map { case Row(geohash: String, areas: String) => (geohash, areas) }
      .collect()
      .toMap
    val tradingAreaBC = sc.broadcast(tradingAreaDic)

    // 2 processing data
    // Filter the data whose 15 identification fields are empty
    val userTagsRDD: RDD[(List[String], List[(String, Double)])] = sourceDF.filter(filterSQL)
      .rdd
      .map(row => {
        // Advertising type, channel and App name
        val adTags = AdTypeTag.make(row)
        val channelTags = ChannelTag.make(row)
        val appNameTags = AppNameTag.make(row, appDicBC.value)

        // Gender, geographical location, equipment
        val sexTags = SexTag.make(row, appDicBC.value)
        val geoTags = GeoTag.make(row, appDicBC.value)
        val deviceTags = DeviceTag.make(row, deviceDicBC.value)

        // Keywords, age, business district
        val keywordTags = KeywordTag.make(row, appDicBC.value)
        val ageTags = AgeTag.make(row, appDicBC.value)
        val tradingAreaTags = tradingAreaTag.make(row, tradingAreaDic.value)

        // Label merge
        val tags = adTags ++ channelTags ++ appNameTags ++ sexTags ++ geoTags ++ deviceTags ++ keywordTags ++ ageTags ++ tradingAreaTags

        // Extract user ID
        val userIds: List[String] = idFields.split(",")
          .map(field => (field, row.getAs[String](field)))
          .filter { case (key, value) => StringUtils.isNotBlank(value) }
          .map { case (key, value) => s"$key::$value" }.toList

        // Return label
        (userIds, tags)
      })
    userTagsRDD.foreach(println)

    // 3. Unify user identification and merge data
    val mergeRDD: RDD[(List[String], List[(String, Double)])] = graphxAnalysis(logTagsRDD)      
      
    // 4 data landing (kudu)
    // Change the List data type to String
    import spark.implicits._
    val resultDF = mergeRDD.map{ case (ids, tags) =>
      (ids.mkString("|||"), tags.map{case (key, value) => s"$key->$value"}.mkString("|||"))
    }.toDF("ids", "tags")
    DBUtils.createTableAndsaveData(kudu, resultDF, sinkTableName, keys)      
      
    // close resource
    sc.stop()  
  }
}

7,Spark GraphX

7.1 basic concept of figure calculation

Graph is a mathematical structure used to represent the model relationship between objects. A graph consists of vertices and edges connecting vertices. Vertices are objects and edges are relationships between objects.

1555407826579.png

A directed graph is a graph in which the edges between vertices are directional. Examples of directed graphs are followers on Twitter. User Bob pays attention to user Carol, but Carol doesn't pay attention to Bob.

1555407851059.png

A graph is a graph that forms the relationship between different objects through points (objects) and edges (paths)

Figure 7.2 calculation application scenario

1) Shortest path

Shortest path in social networks, there is a six dimensional space theory, which means that there will be no more than five people between you and any stranger, that is, you can know any stranger through five intermediaries at most. This is also a graph algorithm, that is, the shortest path between any two people is less than or equal to 6.

2) Community discovery

Community discovery is used to find the number of triangles (circles) in social networks. It can analyze which circles are more stable and closer. It is used to measure the tightness of community coupling. The more triangles in a person's social circle, the more stable and close his social relationship is. For social networking sites such as Facebook and Twitter, the commonly used social analysis algorithm is community discovery.

3) Recommended algorithm (ALS)

Recommendation algorithm (ALS) als is a matrix decomposition algorithm. For example, if a shopping website wants to recommend goods to users, it needs to know which users are interested in which goods. At this time, a matrix can be constructed through ALS. In this matrix, if the goods purchased by users are 1 and those not purchased by users are 0, At this time, we need to calculate which zeros are likely to become 1

GraphX extends Spark RDD through an elastic distributed attribute graph.

Generally, in graph calculation, the basic data structure expression is:

Graph = (Vertex，Edge)
Vertex (vertex / node) (VertexId: Long, info: Any)
Edge Edge(srcId: VertexId, dstId: VertexId, attr) [attr weight]

7.3 Spark GraphX example 1 (strong connector)

1557105493255.png

ID	key word	AppName
1	Carola	Group car
2	Bali, Indonesia	Where to travel
3	Master Shandao	Know
4	The king's woman, the beauty has no tears	Youku
5	World Cup	Sohu
6	Carina Lau, Hong Kong and Taiwan Entertainment	Phoenix net
7	Japan and South Korea Entertainment	Pepper direct seeding
9	AK47	Jedi survival: stimulating the battlefield
10	Funny	YY live broadcast
11	Literature, current politics	Know

ID	IDS
1	43125
2	43125
3	43125
4	43125
5	43125
4	4567
5	4567
6	4567
7	4567
9	91011
10	91011
11	91011

Connected Components algorithm:

1. Define vertex

2. Define edge

3. Generate graph

4. Each connector in the graph is labeled, and the id of the vertex with the smallest sequence number in the connector is taken as the id of the connector

Task:

Define vertex
Define edge
Generate graph
Generating strongly connected graphs

import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object GraphXDemo1 {
  def main(args: Array[String]): Unit = {
    // 1. Initialize sparkcontext
    val conf = new SparkConf()
      .setAppName("GraphXDemo1")
      .setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("ERROR")

    // 2. Define vertices (VertexId: Long, info: Any)
    val vertexes: RDD[(VertexId, Map[String, Double])] = sc.makeRDD(List(
      (1L, Map("keyword:Carola" -> 1.0, "AppName:Group car" -> 1.0)),
      (2L, Map("keyword:Indonesia" -> 1.0, "keyword:Bali" -> 1.0, "AppName:Where to travel" -> 1.0)),
      (3L, Map("keyword:Master Shandao" -> 1.0, "AppName:Know" -> 1.0)),
      (4L, Map("keyword:The king's Woman" -> 1.0, "keyword:Beauty without tears" -> 1.0, "AppName:Youku" -> 1.0)),
      (5L, Map("keyword:World Cup" -> 1.0, "AppName:Sohu" -> 1.0)),
      (6L, Map("keyword:Carina Lau" -> 1.0, "keyword:Hong Kong and Taiwan Entertainment" -> 1.0, "AppName:Phoenix net" -> 1.0)),
      (7L, Map("keyword:Japan and South Korea Entertainment" -> 1.0, "AppName:Pepper direct seeding" -> 1.0)),
      (9L, Map("keyword:AK47" -> 1.0, "AppName:Jedi survival:Stimulate the battlefield" -> 1.0)),
      (10L, Map("keyword:Funny" -> 1.0, "AppName:YY live broadcast" -> 1.0)),
      (11L, Map("keyword:literature" -> 1.0, "keyword:Current politics" -> 1.0, "AppName:Know" -> 1.0))
    ))

    // 3. Define Edge(srcId: VertexId, dstId: VertexId, attr)
    val edges: RDD[Edge[Int]] = sc.makeRDD(List(
      Edge(1L, 42125L, 0),
      Edge(2L, 42125L, 0),
      Edge(3L, 42125L, 0),
      Edge(4L, 42125L, 0),
      Edge(5L, 42125L, 0),
      Edge(4L, 4567L, 0),
      Edge(5L, 4567L, 0),
      Edge(6L, 4567L, 0),
      Edge(7L, 4567L, 0),
      Edge(9L, 91011, 0),
      Edge(10L, 91011, 0),
      Edge(11L, 91011, 0)
    ))

    // 4. Generate diagram; Generate strong connectivity graph
    Graph(vertexes, edges)
      .connectedComponents()
      .vertices
      .sortBy(_._2)
      .collect()
      .foreach(println)

    // 5. Resource release
    sc.stop()
  }
}

7.4 Spark GraphX ex amp le 2 (user identification & data merging)

According to the previous example, we already know how to identify users according to rules and how the program handles them?

Definition of data:

remarks:

1. The data format defined here is completely consistent with the data format in our program

2. RDD is a tuple, and the first element represents various user IDs; The second element represents the user's label

Task:

1. How many users do the six pieces of data represent

2. Merge data of the same user

    val dataRDD: RDD[(List[String], List[(String, Double)])] = sc.makeRDD(List(
      (List("a1", "b1", "c1"), List("keyword$Beijing" -> 1.0, "keyword$Shanghai" -> 1.0, "area$Zhongguancun" -> 1.0)),
      (List("b1", "c2", "d1"), List("keyword$Shanghai" -> 1.0, "keyword$Tianjin" -> 1.0, "area$Hui Longguan" -> 1.0)),
      (List("d1"), List("keyword$Tianjin" -> 1.0, "area$Zhongguancun" -> 1.0)),
      (List("a2", "b2", "c3"), List("keyword$big data" -> 1.0, "keyword$spark" -> 1.0, "area$Xierqi" -> 1.0)),
      (List("b2", "c4", "d4"), List("keyword$spark" -> 1.0, "area$Wudaokou" -> 1.0)),
      (List("c3", "e3"), List("keyword$hive" -> 1.0, "keyword$spark" -> 1.0, "area$Xierqi" -> 1.0))
    ))

Complete processing steps:

Define vertex
Define edge
Generate graph
Find strong connectome
Find data to merge
Data aggregation
Data merging

Handler:

import org.apache.spark.graphx.{Edge, Graph, VertexId, VertexRDD}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object GraphXDemo {
  def main(args: Array[String]): Unit = {
    // initialization
    val conf: SparkConf = new SparkConf().setAppName("GraphXDemo").setMaster("local")
    val sc = new SparkContext(conf)
    sc.setLogLevel("error")

    // Define data
    val dataRDD: RDD[(List[String], List[(String, Double)])] = sc.makeRDD(List(
      (List("a1", "b1", "c1"), List("kw$Beijing" -> 1.0, "kw$Shanghai" -> 1.0, "area$Zhongguancun" -> 1.0)),
      (List("b1", "c2", "d1"), List("kw$Shanghai" -> 1.0, "kw$Tianjin" -> 1.0, "area$Hui Longguan" -> 1.0)),
      (List("d1"), List("kw$Tianjin" -> 1.0, "area$Zhongguancun" -> 1.0)),
      (List("a2", "b2", "c3"), List("kw$big data" -> 1.0, "kw$spark" -> 1.0, "area$Xierqi" -> 1.0)),
      (List("b2", "c4", "d4"), List("kw$spark" -> 1.0, "area$Wudaokou" -> 1.0)),
      (List("c3", "e3"), List("kw$hive" -> 1.0, "kw$spark" -> 1.0, "area$Xierqi" -> 1.0))
    ))

    val value: RDD[(String, List[String], List[(String, Double)])] = dataRDD.flatMap { case (allIds: List[String], tags: List[(String, Double)]) => {
      allIds.map { case elem: String => (elem, allIds, tags) }
    }
    }

    // 1 extract each element in the identification information as an id
    // Note 1. flatMap is used here to flatten the elements;
    // Note 2. The label information is lost here, because this RDD is mainly used to construct vertices and edges, and the tags information is not used
    // Note 3. Vertex and edge data require Long, so data type conversion is done here
    val dotRDD: RDD[(VertexId, VertexId)] = dataRDD.flatMap { case (allids, tags) =>
      // Method 1: easy to understand, not easy to write
      //      for (id <- allids) yield {
      //        (id.hashCode.toLong, allids.mkString.hashCode.toLong)
      //      }

      // Method 2: not easy to understand, easy to write. Equivalence of two methods
      allids.map(id => (id.hashCode.toLong, allids.mkString.hashCode.toLong))
    }

    // 2 define vertices
    val vertexesRDD: RDD[(VertexId, String)] = dotRDD.map { case (id, ids) => (id, "") }

    // 3 define edges (id: single identification information; ids: all identification information)
    val edgesRDD: RDD[Edge[Int]] = dotRDD.map { case (id, ids) => Edge(id, ids, 0) }

    // 4 generation diagram
    val graph = Graph(vertexesRDD, edgesRDD)

    // 5 find the strong connectome
    val connectRDD: VertexRDD[VertexId] = graph.connectedComponents()
      .vertices

    // 6 data defining the center point
    val centerVertexRDD: RDD[(VertexId, (List[String], List[(String, Double)]))] = dataRDD.map { case (allids, tags) =>
      (allids.mkString.hashCode.toLong, (allids, tags))
    }

    // 7. join the data in steps 5 and 6 to obtain the data to be merged
    val allInfoRDD = connectRDD.join(centerVertexRDD)
      .map { case (id1, (id2, (allIds, tags))) => (id2, (allIds, tags)) }

    // 8 data aggregation (i.e. putting the identity and label of the same user together)
    val mergeInfoRDD: RDD[(VertexId, (List[String], List[(String, Double)]))] = allInfoRDD.reduceByKey { case ((bufferList, bufferMap), (allIds, tags)) =>
      val newList = bufferList ++ allIds
      // Merging of map s
      val newMap = bufferMap ++ tags
      (newList, newMap)
    }

    // 9. Data consolidation (allIds: de duplication; tags: consolidation weight)
    val resultRDD: RDD[(List[String], Map[String, Double])] = mergeInfoRDD.map { case (key, (allIds, tags)) =>
      val newIds = allIds.distinct
      // Aggregate according to key; Then the second element of lst obtained by aggregation is accumulated
      val newTags = tags.groupBy(x => x._1).mapValues(lst => lst.map(x => x._2).sum)
      (newIds, newTags)
    }
    resultRDD.foreach(println)

    sc.stop()
  }

  //  def main(args: Array[String]): Unit = {
  //    val lst = List(
  //      ("kw $big data", 1.0),
  //      ("kw$spark",1.0),
  //      ("area $Xierqi", 1.0),
  //      ("kw$spark",1.0),
  //      ("area $Wudaokou", 1.0),
  //      ("kw$hive",1.0),
  //      ("kw$spark",1.0),
  //      ("area $Xierqi", 1.0)
  //    )
  //
  //    lst.groupBy(x=> x._1).map{case (key, value) => (key, value.map(x=>x._2).sum)}.foreach(println)
  //    println("************************************************************")
  //
  //    lst.groupBy(x=> x._1).mapValues(lst => lst.map(x=>x._2).sum).foreach(println)
  //    println("************************************************************")
}

8. Project summary

Posted by apsomum on Wed, 27 Oct 2021 04:51:05 -0700

Practical combat of user portrait project based on spark

Hot Keywords