When we use Spark, we mainly use it to process large quantities of data quickly. So what data sources will we have in actual development and production? I summarize them as follows:
- text
- csv
- json
- parquet
- jdbc
- hive
- kafka
- elasticsearch
Next, all the tests are based on the spark local mode, because the local mode is easy to test and does not depend on the spark cluster environment. One thing to note is that when running code on a spark cluster, the line. master("local [*]") should be removed, and the corresponding pathname needs to be modified to access the local machine file. Take the / tmp/people.txt file as an example:
local mode: / tmp/people.txt
Cluster mode: file:///tmp/people.txt is equivalent to local mode/tmp/people.txt
hdfs://master:8020/tmp/people.txt Distributed System File
Before learning about all kinds of data sources, we should first understand one of the most basic data sources, that is, data set, which is the data we produce according to our own development needs. It is often used to develop and test some simple functions.
Start writing code to make data sets and form a data frame to display them.
from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession\ .builder\ .appName("loadDatas")\ .master("local[*]")\ .enableHiveSupport()\ .getOrCreate() datas = [('Jack', 27), ('Rose', 24), ('Andy', 32)] df = spark.createDataFrame(datas, ['name', 'age']) df.show() # +----+---+ # |name|age| # +----+---+ # |Jack| 27| # |Rose| 24| # |Andy| 32| # +----+---+ spark.stop()
text
The content of the data source people.txt is
Jack 27
Rose 24
Andy 32
Write code to load people.txt and display it through sql
from pyspark.sql import SparkSession from pyspark.sql import Row if __name__ == '__main__': spark = SparkSession\ .builder\ .appName("loadTextData")\ .master("local[*]")\ .getOrCreate() lines = spark.sparkContext.textFile("/home/llh/data/people.txt") parts = lines.map(lambda line: line.split(" ")) people = parts.map(lambda p: Row(name=p[0], age=p[1])) peopledf = spark.createDataFrame(people) peopledf.show() # +---+----+ # |age|name| # +---+----+ # | 27|Jack| # | 24|Rose| # | 32|Andy| # +---+----+ peopledf.createOrReplaceTempView("people") namedf = spark.sql("select name from people where age < 30") namedf.show() # +----+ # |name| # +----+ # |Jack| # |Rose| # +----+ spark.stop()
csv
Data source people.csv content
Jack,27
Rose,24
Andy,32
Write code to load csv data and display it
from pyspark.sql import SparkSession from pyspark.sql.types import * import pandas as pd if __name__ == '__main__': spark = SparkSession\ .builder\ .appName("loadCsvData")\ .master("local[*]")\ .getOrCreate() # Way 1: Another form of table headers generated with Text schema = StructType([ StructField("name", StringType(), True), StructField("age", IntegerType(), True) ]) peopledf = spark.read.csv("/home/llh/data/people.csv", schema=schema) peopledf.show() # +----+---+ # |name|age| # +----+---+ # |Jack| 27| # |Rose| 24| # |Andy| 32| # +----+---+ # Mode 2: Spark is not used in this way data = pd.read_csv("/home/llh/data/people.csv", names=['name','age']) print(data.head()) # name age # 0 Jack 27 # 1 Rose 24 # 2 Andy 32 spark.stop()
json
The data source people.json is:
{"name":"Jack", "age":27}
{"name":"Rose", "age":24}
{"name":"Andy"}
Write code to load json data and display it through interface
from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession\ .builder\ .appName("loadJsonData")\ .master("local[*]")\ .getOrCreate() peopledf = spark.read.json("/home/llh/data/people.json") peopledf.show() # +----+----+ # | age|name| # +----+----+ # | 27 |Jack| # | 24 |Rose| # |null|Andy| # +----+----+ peopledf.printSchema() # root # | -- age: long(nullable=true) # | -- name: string(nullable=true) peopledf.select('name').show() # +----+ # |name| # +----+ # |Jack| # |Rose| # |Andy| # +----+ peopledf.select(peopledf['name'],peopledf['age']+1).show() # +----+---------+ # |name|(age + 1)| # +----+---------+ # |Jack| 28| # |Rose| 25| # |Andy| null| # +----+---------+ peopledf.filter(peopledf['age'] > 25).show() # +---+----+ # |age|name| # +---+----+ # | 27|Jack| # +---+----+ peopledf.groupBy("age").count().show() # +----+-----+ # | age|count| # +----+-----+ # |null| 1| # | 27| 1| # | 24| 1| # +----+-----+ spark.stop()
parquet
This format of data is generally stored on hdfs, and open with a general editor will show a bunch of scrambled code
Write code to load parquet data and display it
from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession\ .builder\ .appName("loadParquetData")\ .master("local[*]")\ .getOrCreate() peopledf = spark.read.parquet("/home/llh/data/people.parquet") peopledf.createOrReplaceTempView("people") namedf = spark.sql("select name from people where age < 30") namedf.show() # +----+ # |name| # +----+ # |Jack| # |Rose| # +----+ spark.stop()
jdbc
jdbc can include mysql, oracle, tidb, etc. Let's take MySQL as an example. The database is test and the table is people.
Write code to load mysql database and display
from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession\ .builder\ .appName("loadJdbcData")\ .master("local[*]")\ .getOrCreate() peopledf = spark.read\ .format("jdbc")\ .option("url", "jdbc:mysql://localhost:3306/test")\ .option("driver", "com.mysql.jdbc.Driver")\ .option("dbtable", "(select * from people) tmp")\ .option("user", "root")\ .option("password", "1")\ .load() peopledf.show() # +----+---+ # |name|age| # +----+---+ # |Jack| 27| # |Rose| 24| # |Andy| 32| # +----+---+ spark.stop()
The MySQL driver can not be found at runtime: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver. The solution is mysql driver Download a driver and put it under the pyspark installation directory jars, default to / usr / local / lib / Python 3.7 / site_package / pyspark / jars/
hive
Hive data storage file separator is a special symbol "^ A", and the general spark configuration of the hive database information, so you can read the hive database directly.
Write code to load people.hive into the people table and display it
from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession\ .builder\ .appName("loadHiveData")\ .master("local[*]")\ .enableHiveSupport()\ .getOrCreate() spark.sql("create table if not exists people (name string, age int) using hive") spark.sql("load data local inpath '/home/llh/data/people.hive' into table people") spark.sql("select * from people").show() # +----+---+ # |name|age| # +----+---+ # |Jack| 27| # |Rose| 24| # |Andy| 32| # +----+---+ spark.stop()
kafka
The combination of kafka and spark is often used for real-time projects, i.e. spark streaming will be written separately later
elasticsearch
es is similar to mysql and other databases
Write code to load and display
from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession\ .builder\ .appName("loadEsData")\ .master("local[*]")\ .enableHiveSupport()\ .getOrCreate() peopledf = spark.read\ .format("org.elasticsearch.spark.sql")\ .option("es.nodes", "localhost")\ .option("es.port", 9200)\ .option("es.resource", "people/data")\ .load() peopledf.registerTempTable("people") spark.sql("select * from people").show() # +----+---+ # |name|age| # +----+---+ # |Jack| 27| # |Rose| 24| # |Andy| 32| # +----+---+ spark.stop()
These are the more commonly used data sources, of course, there are some such as hbase, phoenix and so on... It's not a big problem to master the above several.