Htmalabilitypack + C as IP agent crawler

I. search for data and access as many IP proxy data as possible, and store the IP proxy pool 2. Filter the data from the agent pool and add the valid data to another table, and update it in time III. update IP proxy pool regularly Because the IP address of the website needs to be updated in real time, and the program needs to fi ...

Posted by paradigmapc on Thu, 21 Nov 2019 12:23:12 -0800

Eclipse integrated hadoop plug-in development environment

First, set up the hadoop environment under win10, hadoop can runExtract the installation package and source package of Hadoop 2.7.7, create an empty directory after decompression, and copy all the jar packages under the unpacked source package and other packages except kms directory package under share/hadoop under the installation package to ...

Posted by ThunderVike on Tue, 19 Nov 2019 13:38:13 -0800

2. Principle and Use of spark--spark core

[TOC] 1. Some basic terms in spark RDD: Elastically distributed datasets, the core focus of sparkOperators: Some functions for manipulating RDDapplication: user-written spark Program (DriverProgram + ExecutorProgram)job: an action class operator triggered operationstage: A set of tasks that divide a job into several stages based on dependencie ...

Posted by FeeBle on Fri, 15 Nov 2019 22:22:07 -0800

After aggregating buckets, obtain the total number of buckets

Because es returns the bucket size =10 by default after bucket splitting. That is, if there are many buckets after bucket splitting, how to get the total number of buckets? So now I need to specify the total number of barrels after dividing them? That is, how many barrels will be divided? Colleagues say that you can find out how many buckets ...

Posted by mattachoo on Fri, 15 Nov 2019 11:21:32 -0800

python grabs the list of fantasy novels and stores it in excel

python grabs the list of fantasy novels and stores it in excel Using requests to get information from novel web pages Parsing with beautifulsop Extract and store valid information Using xlwt module to create Excel Finally, get Excel data Using requests to get information from novel web pages First, import the requests li ...

Posted by azunoman on Wed, 13 Nov 2019 11:18:44 -0800

Troubleshooting Spark error -- Error initializing SparkContext

Spark reported an error when submitting the spark job ./spark-shell 19/05/14 05:37:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel) ...

Posted by motofzr1000 on Sun, 10 Nov 2019 08:02:46 -0800

Take you through the art of algorithms

What is an algorithm?Isn't the equation unfamiliar?The correct unknown value is obtained by solving the equation.We can simply interpret solving equations as algorithms.Of course, the algorithm is more than that. Don't worry, I'll talk to you. Let's start with two pieces of code: Both pieces of code can be called algorithms because they solve ...

Posted by koddos on Sat, 09 Nov 2019 10:07:20 -0800

Spark SQL uses beeline to access hive warehouse

I. add hive-site.xml Add the hive-site.xml configuration file under $SPARK_HOME/conf in order to access hive metadata normally vim hive-site.xml <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://192.168.1.201:3306/hiveDB?createDatabaseIfNotExist=true ...

Posted by mrodrigues on Wed, 06 Nov 2019 14:06:19 -0800

I. hbase -- basic principle and use

Hot issues of hbase data: The solution is to preprocess the rowkey of the hot data, add some prefixes, and distribute the hot data to multiple region s. Pre merger? Dynamic partition? At the beginning of the initial data, the data should be partitioned, stored in different region s, and load balanced. Example: for example, it is easy to divide ...

Posted by daniel_mintz on Mon, 04 Nov 2019 16:20:41 -0800

Gausdb 200 uses GDS to import data from a remote server

Gausdb 200 supports importing data in TEXT, CSV and FIXED formats that exist on the remote server into the cluster. This article introduces the use of GDS (Gauss Data Service) tools to import data from remote servers into GaussDB 200. The environment is as follows: 1. Prepare source data Here, from the PostgreSQL database, use the copy command ...

Posted by phppssh on Sun, 03 Nov 2019 23:13:09 -0800