Spark Learning Instance (Python): Load Data Source loads the data source

Keywords: Spark SQL MySQL hive

When we use Spark, we mainly use it to process large quantities of data quickly. So what data sources will we have in actual development and production? I summarize them as follows:

text
csv
json
parquet
jdbc
hive
kafka
elasticsearch

Next, all the tests are based on the spark local mode, because the local mode is easy to test and does not depend on the spark cluster environment. One thing to note is that when running code on a spark cluster, the line. master("local [*]") should be removed, and the corresponding pathname needs to be modified to access the local machine file. Take the / tmp/people.txt file as an example:

local mode: / tmp/people.txt

Cluster mode: file:///tmp/people.txt is equivalent to local mode/tmp/people.txt

hdfs://master:8020/tmp/people.txt Distributed System File

Before learning about all kinds of data sources, we should first understand one of the most basic data sources, that is, data set, which is the data we produce according to our own development needs. It is often used to develop and test some simple functions.

Start writing code to make data sets and form a data frame to display them.

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession\
        .builder\
        .appName("loadDatas")\
        .master("local[*]")\
        .enableHiveSupport()\
        .getOrCreate()

    datas = [('Jack', 27), ('Rose', 24), ('Andy', 32)]
    df = spark.createDataFrame(datas, ['name', 'age'])
    df.show()
    # +----+---+
    # |name|age|
    # +----+---+
    # |Jack| 27|
    # |Rose| 24|
    # |Andy| 32|
    # +----+---+
    spark.stop()

text

The content of the data source people.txt is

Jack 27
Rose 24
Andy 32

Write code to load people.txt and display it through sql

from pyspark.sql import SparkSession
from pyspark.sql import Row

if __name__ == '__main__':
    spark = SparkSession\
        .builder\
        .appName("loadTextData")\
        .master("local[*]")\
        .getOrCreate()
    lines = spark.sparkContext.textFile("/home/llh/data/people.txt")
    parts = lines.map(lambda line: line.split(" "))
    people = parts.map(lambda p: Row(name=p[0], age=p[1]))
    peopledf = spark.createDataFrame(people)
    peopledf.show()
    # +---+----+
    # |age|name|
    # +---+----+
    # | 27|Jack|
    # | 24|Rose|
    # | 32|Andy|
    # +---+----+
    peopledf.createOrReplaceTempView("people")
    namedf = spark.sql("select name from people where age < 30")
    namedf.show()
    # +----+
    # |name|
    # +----+
    # |Jack|
    # |Rose|
    # +----+
    spark.stop()

csv

Data source people.csv content

Jack,27
Rose,24
Andy,32

Write code to load csv data and display it

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pandas as pd

if __name__ == '__main__':
    spark = SparkSession\
        .builder\
        .appName("loadCsvData")\
        .master("local[*]")\
        .getOrCreate()
    # Way 1: Another form of table headers generated with Text
    schema = StructType([
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True)
    ])
    peopledf = spark.read.csv("/home/llh/data/people.csv", schema=schema)
    peopledf.show()
    # +----+---+
    # |name|age|
    # +----+---+
    # |Jack| 27|
    # |Rose| 24|
    # |Andy| 32|
    # +----+---+
    # Mode 2: Spark is not used in this way
    data = pd.read_csv("/home/llh/data/people.csv", names=['name','age'])
    print(data.head())
    #    name  age
    # 0  Jack   27
    # 1  Rose   24
    # 2  Andy   32
    spark.stop()

json

The data source people.json is:

{"name":"Jack", "age":27}
{"name":"Rose", "age":24}
{"name":"Andy"}

Write code to load json data and display it through interface

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession\
        .builder\
        .appName("loadJsonData")\
        .master("local[*]")\
        .getOrCreate()
    peopledf = spark.read.json("/home/llh/data/people.json")
    peopledf.show()
    # +----+----+
    # | age|name|
    # +----+----+
    # | 27 |Jack|
    # | 24 |Rose|
    # |null|Andy|
    # +----+----+
    peopledf.printSchema()
    # root
    # | -- age: long(nullable=true)
    # | -- name: string(nullable=true)
    peopledf.select('name').show()
    # +----+
    # |name|
    # +----+
    # |Jack|
    # |Rose|
    # |Andy|
    # +----+
    peopledf.select(peopledf['name'],peopledf['age']+1).show()
    # +----+---------+
    # |name|(age + 1)|
    # +----+---------+
    # |Jack|       28|
    # |Rose|       25|
    # |Andy|     null|
    # +----+---------+
    peopledf.filter(peopledf['age'] > 25).show()
    # +---+----+
    # |age|name|
    # +---+----+
    # | 27|Jack|
    # +---+----+
    peopledf.groupBy("age").count().show()
    # +----+-----+
    # | age|count|
    # +----+-----+
    # |null|    1|
    # |  27|    1|
    # |  24|    1|
    # +----+-----+
    spark.stop()

parquet

This format of data is generally stored on hdfs, and open with a general editor will show a bunch of scrambled code

Write code to load parquet data and display it

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession\
        .builder\
        .appName("loadParquetData")\
        .master("local[*]")\
        .getOrCreate()
    peopledf = spark.read.parquet("/home/llh/data/people.parquet")
    peopledf.createOrReplaceTempView("people")
    namedf = spark.sql("select name from people where age < 30")
    namedf.show()
    # +----+
    # |name|
    # +----+
    # |Jack|
    # |Rose|
    # +----+
    spark.stop()

jdbc

jdbc can include mysql, oracle, tidb, etc. Let's take MySQL as an example. The database is test and the table is people.

Write code to load mysql database and display

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession\
        .builder\
        .appName("loadJdbcData")\
        .master("local[*]")\
        .getOrCreate()
    peopledf = spark.read\
        .format("jdbc")\
        .option("url", "jdbc:mysql://localhost:3306/test")\
        .option("driver", "com.mysql.jdbc.Driver")\
        .option("dbtable", "(select * from people) tmp")\
        .option("user", "root")\
        .option("password", "1")\
        .load()
    peopledf.show()
    # +----+---+
    # |name|age|
    # +----+---+
    # |Jack| 27|
    # |Rose| 24|
    # |Andy| 32|
    # +----+---+
    spark.stop()

The MySQL driver can not be found at runtime: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver. The solution is mysql driver Download a driver and put it under the pyspark installation directory jars, default to / usr / local / lib / Python 3.7 / site_package / pyspark / jars/

hive

Hive data storage file separator is a special symbol "^ A", and the general spark configuration of the hive database information, so you can read the hive database directly.

Write code to load people.hive into the people table and display it

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession\
        .builder\
        .appName("loadHiveData")\
        .master("local[*]")\
        .enableHiveSupport()\
        .getOrCreate()
    spark.sql("create table if not exists people (name string, age int) using hive")
    spark.sql("load data local inpath '/home/llh/data/people.hive' into table people")

    spark.sql("select * from people").show()
    # +----+---+
    # |name|age|
    # +----+---+
    # |Jack| 27|
    # |Rose| 24|
    # |Andy| 32|
    # +----+---+
    spark.stop()

kafka

The combination of kafka and spark is often used for real-time projects, i.e. spark streaming will be written separately later

elasticsearch

es is similar to mysql and other databases

Write code to load and display

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession\
        .builder\
        .appName("loadEsData")\
        .master("local[*]")\
        .enableHiveSupport()\
        .getOrCreate()
    peopledf = spark.read\
        .format("org.elasticsearch.spark.sql")\
        .option("es.nodes", "localhost")\
        .option("es.port", 9200)\
        .option("es.resource", "people/data")\
        .load()
    peopledf.registerTempTable("people")
    spark.sql("select * from people").show()
    # +----+---+
    # |name|age|
    # +----+---+
    # |Jack| 27|
    # |Rose| 24|
    # |Andy| 32|
    # +----+---+
    spark.stop()

These are the more commonly used data sources, of course, there are some such as hbase, phoenix and so on... It's not a big problem to master the above several.

Posted by Revlet on Thu, 08 Aug 2019 23:40:57 -0700

Programmer Group