Common operators for Spark Core learning (including classic interview questions)

catalogue preface Part I: Transformation operator Value type map() map mapPartitions() executes Map in partition units mapPartitionsWithIndex() with partition number flatMap() flatten glom() partition conversion array groupBy() group Extension: complex wordcount Filter sample() distinct() de duplication coalesce() repartition rep ...

Posted by prawn_86 on Tue, 26 Oct 2021 21:12:41 -0700

Scala -- basic syntax

1. Brief description of syntax /* object: Keyword to declare a singleton object (companion object) */ object HelloWorld{ /* main Method: the executed method can be called directly from the outside def Method name (parameter name: parameter type): return value type = {method body} */ def main(args: Array[String]):Unit = { println('H ...

Posted by quimkaos on Sun, 24 Oct 2021 11:44:53 -0700

Construction of data warehouse environment

Hive environment construction Hive engine introduction Hive engine includes: default MR, tez, spark Hive on Spark: hive not only stores metadata, but also is responsible for SQL parsing and optimization. The syntax is HQL syntax. The execution engine has become Spark, and Spark is responsible for RDD execution. Spark on hive: hive is only ...

Posted by wha??? on Thu, 21 Oct 2021 19:46:17 -0700

Solution of Spark data skew

Data skew caused by Shuffle When data skew occurs during Shuffle, we generally follow the troubleshooting steps ① Check the WEB-UI page to check the execution of tasks in the Stage of each Job, and whether there is an obvious situation that the execution time is too long ② If the task reports an error, check the corresponding log excepti ...

Posted by dstantdog3 on Sat, 16 Oct 2021 10:15:13 -0700

spark advanced: spark streaming usage Structured Streaming

Spark 2.0 has produced a new stream processing framework Structured Streaming, which is a scalable and fault-tolerant stream processing engine built on Spark SQL Engine. Using Structured Streaming, you can perform streaming computing on static data (Dataset/DataFrame) like batch computing. With the continuous arrival of data, Spark SQL Engine w ...

Posted by kimandrel on Sat, 09 Oct 2021 22:42:45 -0700

Learning notes Spark - installation and configuration of Spark cluster

1, Spark cluster topology 1.1 cluster scale 192.168.128.10 master 1.5G ~2G Memory, 20 G Hard disk NAT,1~2 Nuclear; 192.168.128.11 node1 1G Memory, 20 G Hard disk NAT,1 nucleus 192.168.128.12 node2 1G Memory, 20 G Hard disk NAT,1 nucleus 192.168.128.13 node3 1G Memory, 20 G Hard disk NAT,1 nucleus 1.2 Spark installation mo ...

Posted by Snewzzer on Thu, 07 Oct 2021 01:32:19 -0700

Spark series tutorial "Hello World" -- Word Count of big data

Basic summary Spark is a fast, universal and scalable big data analysis engine. It is a big data parallel computing framework based on memory computing. Spark was born in the AMP laboratory at the University of California, Berkeley in 2009. It was open source in 2010 and became the top project of Apache in February 2014. This article is the f ...

Posted by mator on Sat, 25 Sep 2021 11:50:44 -0700

Spark2.4.8 RDD Partitions and Custom Partitions Cases

1. Description Reader: Beginner SparkDevelopment environment: IDEA + spark2.4.8+jdk1.8.0_301Computer Configuration: 4 Core 8 ThreadView CPU methods: In Windows, type "wmic" in the cmd command, then enter "cpu get Name", "cpu get NumberOfCores", "cpu get NumberOfLogicalProcessors" in the ...

Posted by snoopgreen on Sat, 25 Sep 2021 09:13:54 -0700

spark source code tracks the submission of tasks in the yarn cluster mode

1, Run command bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --class org.apache.spark.examples.SparkPi \ examples/jars/spark-examples_2.11-2.3.1.3.0.1.0-187.jar 2, Task submission flowchart 3, Startup script View the spark submit script file. The program entry is exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.depl ...

Posted by stevel on Tue, 21 Sep 2021 03:20:03 -0700

Spark--spark Core Programming (RDD)

The Spark Computing Framework encapsulates three data structures for high concurrency and high throughput data processing in different application scenarios. RDD: Elastic Distributed DatasetAccumulator: Distributed shared write-only variablesBroadcast variables: distributed shared read-only variables RDD 1. What is RDD RDD (Resilient Di ...

Posted by ramez_sever on Sat, 18 Sep 2021 11:30:40 -0700