Htmalabilitypack + C as IP agent crawler
I. search for data and access as many IP proxy data as possible, and store the IP proxy pool
2. Filter the data from the agent pool and add the valid data to another table, and update it in time
III. update IP proxy pool regularly
Because the IP address of the website needs to be updated in real time, and the program needs to fi ...
Posted by paradigmapc on Thu, 21 Nov 2019 12:23:12 -0800
Eclipse integrated hadoop plug-in development environment
First, set up the hadoop environment under win10, hadoop can runExtract the installation package and source package of Hadoop 2.7.7, create an empty directory after decompression, and copy all the jar packages under the unpacked source package and other packages except kms directory package under share/hadoop under the installation package to ...
Posted by ThunderVike on Tue, 19 Nov 2019 13:38:13 -0800
2. Principle and Use of spark--spark core
[TOC]
1. Some basic terms in spark
RDD: Elastically distributed datasets, the core focus of sparkOperators: Some functions for manipulating RDDapplication: user-written spark Program (DriverProgram + ExecutorProgram)job: an action class operator triggered operationstage: A set of tasks that divide a job into several stages based on dependencie ...
Posted by FeeBle on Fri, 15 Nov 2019 22:22:07 -0800
After aggregating buckets, obtain the total number of buckets
Because es returns the bucket size =10 by default after bucket splitting. That is, if there are many buckets after bucket splitting, how to get the total number of buckets?
So now I need to specify the total number of barrels after dividing them? That is, how many barrels will be divided?
Colleagues say that you can find out how many buckets ...
Posted by mattachoo on Fri, 15 Nov 2019 11:21:32 -0800
python grabs the list of fantasy novels and stores it in excel
python grabs the list of fantasy novels and stores it in excel
Using requests to get information from novel web pages
Parsing with beautifulsop
Extract and store valid information
Using xlwt module to create Excel
Finally, get Excel data
Using requests to get information from novel web pages
First, import the requests li ...
Posted by azunoman on Wed, 13 Nov 2019 11:18:44 -0800
Troubleshooting Spark error -- Error initializing SparkContext
Spark reported an error when submitting the spark job
./spark-shell
19/05/14 05:37:40 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel) ...
Posted by motofzr1000 on Sun, 10 Nov 2019 08:02:46 -0800
Take you through the art of algorithms
What is an algorithm?Isn't the equation unfamiliar?The correct unknown value is obtained by solving the equation.We can simply interpret solving equations as algorithms.Of course, the algorithm is more than that. Don't worry, I'll talk to you. Let's start with two pieces of code:
Both pieces of code can be called algorithms because they solve ...
Posted by koddos on Sat, 09 Nov 2019 10:07:20 -0800
Spark SQL uses beeline to access hive warehouse
I. add hive-site.xml
Add the hive-site.xml configuration file under $SPARK_HOME/conf in order to access hive metadata normally
vim hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.1.201:3306/hiveDB?createDatabaseIfNotExist=true ...
Posted by mrodrigues on Wed, 06 Nov 2019 14:06:19 -0800
I. hbase -- basic principle and use
Hot issues of hbase data:
The solution is to preprocess the rowkey of the hot data, add some prefixes, and distribute the hot data to multiple region s.
Pre merger? Dynamic partition? At the beginning of the initial data, the data should be partitioned, stored in different region s, and load balanced.
Example: for example, it is easy to divide ...
Posted by daniel_mintz on Mon, 04 Nov 2019 16:20:41 -0800
Gausdb 200 uses GDS to import data from a remote server
Gausdb 200 supports importing data in TEXT, CSV and FIXED formats that exist on the remote server into the cluster. This article introduces the use of GDS (Gauss Data Service) tools to import data from remote servers into GaussDB 200. The environment is as follows:
1. Prepare source data
Here, from the PostgreSQL database, use the copy command ...
Posted by phppssh on Sun, 03 Nov 2019 23:13:09 -0800