1, Read data source
(1) Read json and use spark.read. Note: the path is from HDFS by default. If you want to read the native file, you need to prefix it file: / /, as follows
scala> val people = spark.read.format("json").load("file:///opt/software/data/people.json") people: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> people.show +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+
spark.read.format("json").load("file:///opt/software/data/people.json")
Equivalent to spark.read.json(“ file:///opt/software/data/people.json")
To read other format files, just modify format("json"), such as format("parquet")
(2) Read hive table and use spark.sql. The name of hive database is default (the default database name can be omitted), and the table is people
scala> val peopleDF=spark.sql("select * from default.people") peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more field] scala> peopleDF.show +--------+---+--------+ | name|age| address| +--------+---+--------+ |zhangsan| 22| chengdu| | wangwu| 33| beijing| | lisi| 28|shanghai| +--------+---+--------+ scala> peopleDF.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = true) |-- address: string (nullable = true)
2, Get data column
The three ways to get columns are as follows
scala> peopleDF.select("name","age").show +--------+---+ | name|age| +--------+---+ |zhangsan| 22| | wangwu| 33| | lisi| 28| +--------+---+ scala> peopleDF.select($"name",$"age").show +--------+---+ | name|age| +--------+---+ |zhangsan| 22| | wangwu| 33| | lisi| 28| +--------+---+ scala> peopleDF.select(peopleDF.col("name"),peopleDF.col("age")).show +--------+---+ | name|age| +--------+---+ |zhangsan| 22| | wangwu| 33| | lisi| 28| +--------+---+
Note: if you edit the code in IDEA and use $, you must add the statement: import spark. Implies.; otherwise, the $expression will report an error. Spark shell has been imported by default
$"Column name" this is the syntax sugar, which returns the Column object