Common operators for Spark Core learning (including classic interview questions)
catalogue
preface
Part I: Transformation operator
Value type
map() map
mapPartitions() executes Map in partition units
mapPartitionsWithIndex() with partition number
flatMap() flatten
glom() partition conversion array
groupBy() group
Extension: complex wordcount
Filter
sample()
distinct() de duplication
coalesce() repartition
rep ...
Posted by prawn_86 on Tue, 26 Oct 2021 21:12:41 -0700
Scala -- basic syntax
1. Brief description of syntax
/*
object: Keyword to declare a singleton object (companion object)
*/
object HelloWorld{
/*
main Method: the executed method can be called directly from the outside
def Method name (parameter name: parameter type): return value type = {method body}
*/
def main(args: Array[String]):Unit = {
println('H ...
Posted by quimkaos on Sun, 24 Oct 2021 11:44:53 -0700
Construction of data warehouse environment
Hive environment construction
Hive engine introduction
Hive engine includes: default MR, tez, spark Hive on Spark: hive not only stores metadata, but also is responsible for SQL parsing and optimization. The syntax is HQL syntax. The execution engine has become Spark, and Spark is responsible for RDD execution. Spark on hive: hive is only ...
Posted by wha??? on Thu, 21 Oct 2021 19:46:17 -0700
Solution of Spark data skew
Data skew caused by Shuffle
When data skew occurs during Shuffle, we generally follow the troubleshooting steps
① Check the WEB-UI page to check the execution of tasks in the Stage of each Job, and whether there is an obvious situation that the execution time is too long
② If the task reports an error, check the corresponding log excepti ...
Posted by dstantdog3 on Sat, 16 Oct 2021 10:15:13 -0700
spark advanced: spark streaming usage Structured Streaming
Spark 2.0 has produced a new stream processing framework Structured Streaming, which is a scalable and fault-tolerant stream processing engine built on Spark SQL Engine. Using Structured Streaming, you can perform streaming computing on static data (Dataset/DataFrame) like batch computing. With the continuous arrival of data, Spark SQL Engine w ...
Posted by kimandrel on Sat, 09 Oct 2021 22:42:45 -0700
Learning notes Spark - installation and configuration of Spark cluster
1, Spark cluster topology
1.1 cluster scale
192.168.128.10 master 1.5G ~2G Memory, 20 G Hard disk NAT,1~2 Nuclear;
192.168.128.11 node1 1G Memory, 20 G Hard disk NAT,1 nucleus
192.168.128.12 node2 1G Memory, 20 G Hard disk NAT,1 nucleus
192.168.128.13 node3 1G Memory, 20 G Hard disk NAT,1 nucleus
1.2 Spark installation mo ...
Posted by Snewzzer on Thu, 07 Oct 2021 01:32:19 -0700
Spark series tutorial "Hello World" -- Word Count of big data
Basic summary
Spark is a fast, universal and scalable big data analysis engine. It is a big data parallel computing framework based on memory computing. Spark was born in the AMP laboratory at the University of California, Berkeley in 2009. It was open source in 2010 and became the top project of Apache in February 2014.
This article is the f ...
Posted by mator on Sat, 25 Sep 2021 11:50:44 -0700
Spark2.4.8 RDD Partitions and Custom Partitions Cases
1. Description
Reader: Beginner SparkDevelopment environment: IDEA + spark2.4.8+jdk1.8.0_301Computer Configuration:
4 Core 8 ThreadView CPU methods:
In Windows, type "wmic" in the cmd command, then enter "cpu get Name", "cpu get NumberOfCores", "cpu get NumberOfLogicalProcessors" in the ...
Posted by snoopgreen on Sat, 25 Sep 2021 09:13:54 -0700
spark source code tracks the submission of tasks in the yarn cluster mode
1, Run command
bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
examples/jars/spark-examples_2.11-2.3.1.3.0.1.0-187.jar
2, Task submission flowchart
3, Startup script
View the spark submit script file. The program entry is
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.depl ...
Posted by stevel on Tue, 21 Sep 2021 03:20:03 -0700
Spark--spark Core Programming (RDD)
The Spark Computing Framework encapsulates three data structures for high concurrency and high throughput data processing in different application scenarios.
RDD: Elastic Distributed DatasetAccumulator: Distributed shared write-only variablesBroadcast variables: distributed shared read-only variables
RDD
1. What is RDD
RDD (Resilient Di ...
Posted by ramez_sever on Sat, 18 Sep 2021 11:30:40 -0700