Spark Streaming of big data technology
Spark Streaming is used for streaming data processing. Spark Streaming supports many data input sources, such as Kafka, Flume, Twitter, ZeroMQ and simple TCP sockets. After data input, you can use Spark's highly abstract primitives such ...
Posted by croakingtoad on Mon, 10 Feb 2020 07:28:21 -0800
Build our Spark platform from scratch
1. Preparing the centeros environment
In order to build a real cluster environment and achieve a highly available architecture, we should prepare at least three virtual machines as cluster nodes. So I bought three Alibaba cloud servers as our cluster nodes.
Posted by knelson on Tue, 04 Feb 2020 23:47:57 -0800
This topic is similar to some of the search topics in Leetcode.
The problem you want to deal with is: count the number of two adjacent digits of a word. If there are w1,w2,w3,w4,w5,w6, then:
The final output is (word,neighbor,frequency).
We implement it in five ways:
Spark SQL method
Spark SQL for Scala
Posted by olechka on Sun, 02 Feb 2020 08:18:59 -0800
1, Read data source
(1) Read json and use spark.read. Note: the path is from HDFS by default. If you want to read the native file, you need to prefix it file: / /, as follows
scala> val people = spark.read.format("json").load("file:///opt/software/data/people.json")
people: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
Posted by Pie on Sun, 02 Feb 2020 08:18:33 -0800
1, Data type
There are four types of data: string, integer, list and dictionary.
Code to: < string len gt h >: < string >
For example: 4:test is represented by the string "test"
4: Examples are represented as string "examples"
String length in bytes ...
Posted by inkdrop on Sat, 01 Feb 2020 05:03:09 -0800
Use hint to specify the broadcast table, but cannot perform the specified broadcast;
preparation in advance
hive> select * from test.tmp_demo_small;
hive> analyze table test.tmp_demo_small compute statis ...
Posted by cbullock on Fri, 17 Jan 2020 06:02:22 -0800
2: Why Flink
3: What industries need
4: Features of Flink
5: The difference with sparkStreaming
6: Preliminary development
7: Flink configuration description
9: Running components
Flink is a framework and distributed com ...
Posted by stodge on Fri, 17 Jan 2020 01:18:24 -0800
Project address: https://github.com/KingBobTitan/hadoop.git
MR's Shuffle explanation and Join implementation
1. MapReduce's history monitoring service: JobHistoryServer
Function: used to monitor the information of all MapReduce programs running on YARN
Configure log ...
Posted by nick1 on Tue, 14 Jan 2020 02:21:13 -0800
waterdrop filters and processes log files to store data
Download the installation package of waterdrop using wget
Extract to the directory you need
Unzip XXX (package location) XXX (decompression location)
If unzip reports an error, please download the corresponding command yourself.
Set the dependency env ...
Posted by PhantomCube on Mon, 13 Jan 2020 01:04:18 -0800
Of course, there is nothing to say about regular window functions. It's very simple. Here's an introduction to grouping, focusing on the usage of rows between after grouping and sorting.
The key is to understand the meaning of keywords in rows between:
In the future
Posted by skyxmen on Thu, 09 Jan 2020 07:26:16 -0800