Introduction to SparkSQL Case 2 (SparkSQL 1.x)

The main ideas in the introduction case of SparkSQL are as follows: Create SparkContext Create SQLContext 3). Create RDD 4. Create a class and define its member variables 5. Collate data and associate class es 6. Converting RDD to DataFrame (Importing Implicit Conversion) 7. Register the DataFrame as a temporary table 8. Writi ...

Posted by pietbez on Thu, 31 Jan 2019 19:57:15 -0800

Introduction to spark 4 (RDD Advanced Operator 1)

1. mapPartitionsWithIndex Create RDD with a specified partition number of 2 scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7),2) View partition scala> rdd1.partitions - As follows: res0: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@691, org.apache.spark.rdd.ParallelColle ...

Posted by r4ck4 on Wed, 30 Jan 2019 19:06:16 -0800

Spark Streaming integrates flume(Poll and Push)

As a framework of log real-time collection, flume can be connected with SparkStreaming real-time processing framework. Flume generates data in real-time and sparkStreaming does real-time processing. Spark Streaming docks with FlumeNG in two ways: one is that FlumeNG pushes the message Push to Spark Streaming, the other is that S ...

Posted by kobayashi_one on Wed, 30 Jan 2019 17:18:15 -0800

spark-streaming sample program

Develop spark-streaming to receive data worldcount from server port in real time. Environment building idea+maven's pom file is as follows: <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLoc ...

Posted by phpnew on Wed, 30 Jan 2019 12:21:15 -0800

Spark Exercises: Seeking TopN for Teachers of All Subjects

[Note] This article refers to learning videos from Calf School. Spark Exercises: Seeking TopN for Teachers of All Subjects Data format: http://bigdata.edu360.cn/laozhang 1. Data segmentation val func=(line:String)=>{   val index=line.lastIndexOf("/")   val teacher=line.substring(index+1)   val httpHost=line.substrin ...

Posted by SureFire on Tue, 29 Jan 2019 23:33:14 -0800

SparkSQL View Debugging Generated Code

Spark SQL (DataFrame) is introduced in websites and some books to generate final running statements based on the corresponding operations. This article starts with a simple, low-level problem and ends with a look at the generated code to find the root cause of the problem and a brief introduction to how to debug SparkSQL. Sour ...

Posted by NikkiLoveGod on Tue, 29 Jan 2019 20:51:15 -0800

Spark Learning Notes (12) - SparkSQL

1 SparkSQL Introduction Spark SQL is a module Spark uses to process structured data. It provides a programming abstraction called DataFrame and serves as a distributed SQL query engine. Hive has been learned. It converts Hive SQL into MapReduce and submits it to cluster for execution. It greatly simplifies the complexity of pr ...

Posted by Dorin85 on Sat, 26 Jan 2019 00:24:15 -0800

Java Api has no aggregation problem after writing Spark program reduceByKey (custom type as Key)

Writing Spark using Java Api If PairRDD's key value is a custom type, you need to override hashcode and equals methods, otherwise you will find that the same Key value is not aggregated. For example: Use User type as Key ​ public class User { private String name; private String age; public String getName() { return name; } pu ...

Posted by harrisonad on Thu, 24 Jan 2019 20:18:13 -0800

idea writes WordCount program under windows and uploads it to hadoop cluster by jar package (fool version)

Typically, programs are programmed in IDE, then packaged as jar packages, and submitted to the cluster. The most commonly used method is to create a Maven project to manage the dependencies of jar packages using Maven. 1. Generating jar packages for WordCount 1. Open IDEA, File New Project Maven Next Fill in Groupld and Artif ...

Posted by kcgame on Thu, 24 Jan 2019 19:45:14 -0800

Spark-core Comprehensive Exercise-IP Matching

ip.txt data: 220.177.248.0 | 220.177.255.255 | 3702650880 | 3702652927 | Asia | China | Jiangxi | Nanchang | Telecom | 360100|China|CN|115.892151|28.676493 220.178.0.0 | 220.178.56.113 | 3702652928 | 3702667377 | Asia | China | Anhui | Hefei | Telecom | 340100|China|CN|117.283042|31.86119 220.178.56.114 | 220.178.57.33 | 37026 ...

Posted by penguin_powered on Thu, 24 Jan 2019 16:18:14 -0800