Myspark startup process decryption

Original author: Li Haiqiang, from the retail big data team of Ping An Bank ​ Preface As a data engineer, you may encounter many ways to start PySpark. You may not understand what they have in common, what differences they have, and what impact different methods have on program development and deploymen ...

Posted by phant0m on Sat, 29 Feb 2020 23:24:31 -0800

NVIDIA rapids cuGraph model

The RAPIDS cuGraph library is a set of graph analysis used to process data in GPU data frames - see cuDF. cuGraph is designed to provide NetworkX like API s that are familiar to data scientists, so they can now build GPU accelerated workflows more easily Official documents:rapidsai/cugraphcuGraph API Re ...

Posted by florida_guy99 on Tue, 25 Feb 2020 06:53:28 -0800

RDD common operations in pyspark

preparation: import pyspark from pyspark import SparkContext from pyspark import SparkConf conf=SparkConf().setAppName("lg").setMaster('local[4]') #local[4] means to run 4 kernels locally sc=SparkContext.getOrCreate(conf) 1. Parallel and collect The parallelize function converts the list obj ...

Posted by moomsdad on Fri, 21 Feb 2020 02:13:19 -0800

Spark Streaming of big data technology

Spark Streaming of big data technology 1: Overview 1. Definition: Spark Streaming is used for streaming data processing. Spark Streaming supports many data input sources, such as Kafka, Flume, Twitter, ZeroMQ and simple TCP sockets. After data input, you can use Spark's highly abstract primitives such ...

Posted by croakingtoad on Mon, 10 Feb 2020 07:28:21 -0800

How to quickly build a Spark distributed architecture for big data

Build our Spark platform from scratch 1. Preparing the centeros environment In order to build a real cluster environment and achieve a highly available architecture, we should prepare at least three virtual machines as cluster nodes. So I bought three Alibaba cloud servers as our cluster nodes. ...

Posted by knelson on Tue, 04 Feb 2020 23:47:57 -0800

Find the number of adjacent words in large amount of data

This topic is similar to some of the search topics in Leetcode. The problem you want to deal with is: count the number of two adjacent digits of a word. If there are w1,w2,w3,w4,w5,w6, then: The final output is (word,neighbor,frequency). We implement it in five ways: MapReduce Spark Spark SQL method Scala method Spark SQL for Scala MapReduce ...

Posted by olechka on Sun, 02 Feb 2020 08:18:59 -0800

Spark SQL/DataFrame/DataSet operation ----- read data

1, Read data source (1) Read json and use spark.read. Note: the path is from HDFS by default. If you want to read the native file, you need to prefix it file: / /, as follows scala> val people = spark.read.format("json").load("file:///opt/software/data/people.json") people: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scal ...

Posted by Pie on Sun, 02 Feb 2020 08:18:33 -0800

BT seed structure and coding analysis

1, Data type There are four types of data: string, integer, list and dictionary. Strings (string) Code to: < string len gt h >: < string > For example: 4:test is represented by the string "test" 4: Examples are represented as string "examples" String length in bytes ...

Posted by inkdrop on Sat, 01 Feb 2020 05:03:09 -0800

The trap of Broadcast Join in SparkSql 2.x (hint does not work)

Problem description Use hint to specify the broadcast table, but cannot perform the specified broadcast; preparation in advance hive> select * from test.tmp_demo_small; OK tmp_demo_small.pas_phone tmp_demo_small.age 156 20 157 22 158 15 hive> analyze table test.tmp_demo_small compute statis ...

Posted by cbullock on Fri, 17 Jan 2020 06:02:22 -0800

Flink of big data learning

Catalog   1: Introduction 2: Why Flink 3: What industries need 4: Features of Flink 5: The difference with sparkStreaming 6: Preliminary development 7: Flink configuration description Eight: Environment 9: Running components 1: Introduction Flink is a framework and distributed com ...

Posted by stodge on Fri, 17 Jan 2020 01:18:24 -0800