Myspark startup process decryption
Original author: Li Haiqiang, from the retail big data team of Ping An Bank
Preface
As a data engineer, you may encounter many ways to start PySpark. You may not understand what they have in common, what differences they have, and what impact different methods have on program development and deploymen ...
Posted by phant0m on Sat, 29 Feb 2020 23:24:31 -0800
NVIDIA rapids cuGraph model
The RAPIDS cuGraph library is a set of graph analysis used to process data in GPU data frames - see cuDF. cuGraph is designed to provide NetworkX like API s that are familiar to data scientists, so they can now build GPU accelerated workflows more easily
Official documents:rapidsai/cugraphcuGraph API Re ...
Posted by florida_guy99 on Tue, 25 Feb 2020 06:53:28 -0800
RDD common operations in pyspark
preparation:
import pyspark
from pyspark import SparkContext
from pyspark import SparkConf
conf=SparkConf().setAppName("lg").setMaster('local[4]') #local[4] means to run 4 kernels locally
sc=SparkContext.getOrCreate(conf)
1. Parallel and collect
The parallelize function converts the list obj ...
Posted by moomsdad on Fri, 21 Feb 2020 02:13:19 -0800
Spark Streaming of big data technology
Spark Streaming of big data technology
1: Overview
1. Definition:
Spark Streaming is used for streaming data processing. Spark Streaming supports many data input sources, such as Kafka, Flume, Twitter, ZeroMQ and simple TCP sockets. After data input, you can use Spark's highly abstract primitives such ...
Posted by croakingtoad on Mon, 10 Feb 2020 07:28:21 -0800
How to quickly build a Spark distributed architecture for big data
Build our Spark platform from scratch
1. Preparing the centeros environment
In order to build a real cluster environment and achieve a highly available architecture, we should prepare at least three virtual machines as cluster nodes. So I bought three Alibaba cloud servers as our cluster nodes.
...
Posted by knelson on Tue, 04 Feb 2020 23:47:57 -0800
Find the number of adjacent words in large amount of data
This topic is similar to some of the search topics in Leetcode.
The problem you want to deal with is: count the number of two adjacent digits of a word. If there are w1,w2,w3,w4,w5,w6, then:
The final output is (word,neighbor,frequency).
We implement it in five ways:
MapReduce
Spark
Spark SQL method
Scala method
Spark SQL for Scala
MapReduce ...
Posted by olechka on Sun, 02 Feb 2020 08:18:59 -0800
Spark SQL/DataFrame/DataSet operation ----- read data
1, Read data source
(1) Read json and use spark.read. Note: the path is from HDFS by default. If you want to read the native file, you need to prefix it file: / /, as follows
scala> val people = spark.read.format("json").load("file:///opt/software/data/people.json")
people: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scal ...
Posted by Pie on Sun, 02 Feb 2020 08:18:33 -0800
BT seed structure and coding analysis
1, Data type
There are four types of data: string, integer, list and dictionary.
Strings (string)
Code to: < string len gt h >: < string >
For example: 4:test is represented by the string "test"
4: Examples are represented as string "examples"
String length in bytes ...
Posted by inkdrop on Sat, 01 Feb 2020 05:03:09 -0800
The trap of Broadcast Join in SparkSql 2.x (hint does not work)
Problem description
Use hint to specify the broadcast table, but cannot perform the specified broadcast;
preparation in advance
hive> select * from test.tmp_demo_small;
OK
tmp_demo_small.pas_phone tmp_demo_small.age
156 20
157 22
158 15
hive> analyze table test.tmp_demo_small compute statis ...
Posted by cbullock on Fri, 17 Jan 2020 06:02:22 -0800
Flink of big data learning
Catalog
1: Introduction
2: Why Flink
3: What industries need
4: Features of Flink
5: The difference with sparkStreaming
6: Preliminary development
7: Flink configuration description
Eight: Environment
9: Running components
1: Introduction
Flink is a framework and distributed com ...
Posted by stodge on Fri, 17 Jan 2020 01:18:24 -0800