Spark SQL/DataFrame/DataSet operation ----- read data

1, Read data source

(1) Read json and use spark.read. Note: the path is from HDFS by default. If you want to read the native file, you need to prefix it file: / /, as follows

scala> val people = spark.read.format("json").load("file:///opt/software/data/people.json")
people: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
 
scala> people.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

spark.read.format("json").load("file:///opt/software/data/people.json")

Equivalent to spark.read.json(“ file:///opt/software/data/people.json")

To read other format files, just modify format("json"), such as format("parquet")

(2) Read hive table and use spark.sql. The name of hive database is default (the default database name can be omitted), and the table is people

scala> val peopleDF=spark.sql("select * from default.people")
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more field]
 
scala> peopleDF.show
+--------+---+--------+
|    name|age| address|
+--------+---+--------+
|zhangsan| 22| chengdu|
|  wangwu| 33| beijing|
|    lisi| 28|shanghai|
+--------+---+--------+
 
 
scala> peopleDF.printSchema
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- address: string (nullable = true)

2, Get data column

The three ways to get columns are as follows

scala> peopleDF.select("name","age").show
+--------+---+
|    name|age|
+--------+---+
|zhangsan| 22|
|  wangwu| 33|
|    lisi| 28|
+--------+---+
 
scala> peopleDF.select($"name",$"age").show
+--------+---+
|    name|age|
+--------+---+
|zhangsan| 22|
|  wangwu| 33|
|    lisi| 28|
+--------+---+
scala> peopleDF.select(peopleDF.col("name"),peopleDF.col("age")).show
+--------+---+
|    name|age|
+--------+---+
|zhangsan| 22|
|  wangwu| 33|
|    lisi| 28|
+--------+---+

Note: if you edit the code in IDEA and use $, you must add the statement: import spark. Implies.; otherwise, the $expression will report an error. Spark shell has been imported by default

$"Column name" this is the syntax sugar, which returns the Column object

Posted by Pie on Sun, 02 Feb 2020 08:18:33 -0800

Programmer Group

Spark SQL/DataFrame/DataSet operation ----- read data

1, Read data source

(1) Read json and use spark.read. Note: the path is from HDFS by default. If you want to read the native file, you need to prefix it file: / /, as follows

(2) Read hive table and use spark.sql. The name of hive database is default (the default database name can be omitted), and the table is people

2, Get data column

Hot Keywords