Getting Started with Python Web Crawlers

Keywords: Python OpenCV Embedded system Single-Chip Microcomputer computer networks

Python Web Crawler (1) Getting Started

Libraries used: requestspip+BeautifulSoup4pip+tqdmpip+html5lib

python version: 3.8

Compilation environment: Jupyter Notebook (Anaconda3)

Tour: Chrome

Knowledge involved: HTML Hypertext Markup Language

, What is a crawl

1. Web Crawler (Web Crawler or Web Spider)

It is a program that automatically browses web pages and collects the required information. He can download web page data from the Internet for search engines, which is an important part of search engines.

  • The crawler starts with the URL of the initial page and gets the URL on the initial page.
  • In the process of crawling web pages, constantly pull new URL s from the current page into the queue;
  • Until the stopping condition given by the system is satisfied
Internet Diagram

  • Each node is a web page
  • Each side is a hyperlink
  • A web crawler is one that grabs interesting content from such a network diagram

2. Web Page Grabbing Strategy

Generally speaking, web crawling strategies can be divided into three categories:

  • Width first:

    In the capture process, after the current level of search is completed, the next level of search is carried out

    Features: The design and implementation of the algorithm is relatively simple. Fundamental idea: There is a high probability that a Web page within a certain connection distance from the initial URL will be subject-related.

  • Best priority:

    The best priority search strategy predicts the similarity or relevance of candidate URLs to the target Web page according to a certain Web page analysis algorithm, and selects one or more URLs that have the best evaluation to capture. It only visits pages that have been predicted to be "useful" by page analysis algorithms

    Features: The best-priority strategy is a locally optimal search, and many related pages on crawler-grabbing paths may be ignored.

  • Depth first:

    Start from the start page, select a URL to enter, analyze the URL in this page, select one to enter again. Such a connection is grabbed one by one until the next route is processed.

    Features: The algorithm design is relatively simple, but the value of the web page and PageRank decrease with each level of depth, so it is rarely used.

Depth-first causes trapped crawls in many cases, and the most common are breadth-first and best-first methods.

3. Classification of Web Crawlers

Generally speaking, web crawlers can be divided into the following four categories:

  • Universal Web Crawler:

    Scalable Web Crawler crawls extend from seed URL s to the entire Web, collecting data for portal site search engines and large Web service providers.

    Generic Web crawlers start with one or more initial seed URLs preset to obtain a list of URLs on the initial Web page, which are accessed and downloaded from the URL queue as they crawl.

  • Incremental crawls:

    Incremental Web Crawler refers to a crawler that incrementally updates downloaded pages and crawls only newly generated or changed pages. It ensures to some extent that the pages crawled are as new as possible.

    Incremental crawlers have two goals:

    1. Keep locally centrally stored pages up to date
    2. Improving the quality of hit pages on local pages

    General commercial search engines such as GoogleBaidu Baidu And so on, which are essentially incremental crawls.

  • Vertical crawls:

    Also known as Focused Crawler, or Topical Crawler, refers to the selective crawling of web crawlers that crawl pages related to predefined topics. E-mail addresses, e-books, commodity prices, etc.

    The key to crawling strategy implementation is to evaluate the importance of page content and links. Different methods calculate different importance, which results in different access order of links

  • Deep Web Crawler:

    web pages that are hidden behind the search form and can only be obtained by submitting some keywords by the user.

    The most important part of Deep Web crawling is form filling, which can be of two types:

    1. Form filling based on domain knowledge
    2. Form filling based on Web page structure analysis

4. Scope of use

  • As a web collector of search engines, grab the entire Internet such as GoogleBaidu Baidu And so on;
  • As a vertical search engine, grab information on specific topics, such as video site bilibili bili, picture site Pixiv, and so on.
  • Used as a detection tool for testing the front-end of a website to assess the robustness of the front-end code

4.1 Web Crawler Legality** (Robots Protocol)**

Robots protocol:

Also known as the Robot Protocol or the Crawler Protocol, this protocol stipulates the scope of web content grabbed by search engines, including whether or not web sites want to be grabbed by search engines and what content is not allowed to be grabbed, so web crawlers "consciously" grab or do not grab the web content accordingly. Since its launch, the Robots protocol has become an international practice for websites to protect sensitive data and privacy of their users.

  • Robots protocol is implemented through robots.txt
  • The robots.txt file should be placed in the network root directory
  • When a crawler visits a site, it first checks to see if robots.txt exists in the root directory of the site, and if it does, the search robot follows the rules in the file to determine the extent of access. If the file does not exist, all crawlers will be able to access all password-protected pages on the site.

Robots.txt file syntax:

  • Grammar:

    User-agent: |agent_name The **** here represents all search engine categories and is a wildcard character

    Disallow: /dir_name / Defined here to prohibit crawling dir_ Files under name directory

    Allow: /dir_name / Defined here to allow crawling dir_ The entire directory of name

  • Example:

    User-agent: Googlebot

    Allow: /folder1/myfile.html

    Disallow: /folder1/

    Only the content of the myfile.html page in the folder1 directory is allowed to be crawled.

5. Basic architecture of Crawlers

Web crawlers typically consist of four modules:

  • URL Management Module
  • Download module
  • Parsing Module
  • Storage module
Crawler Basic Architecture Diagram
  1. SeedURLs: A crawler startup entry, usually a set of URLs

  2. URL Queue:URL management module, responsible for managing and dispatching all URLs

  3. Download modules, and for efficiency, crawlers are usually multiple parallel

  4. Parsing module: parse valuable information from web pages and add newly discovered URLs to URLQueue

  5. Storage module: Store data in a storage medium, usually a file or database

This diagram is written in mermaid and can be copied directly.

1. Code Instance - Grab One ACM Title Web Site Topic data for

1. Install necessary packages

  • pip install requests
  • pip install BeautifulSoup4
  • pip install tqdm
  • pip install html5lib

2. Analyzing websites

Open Web Site ACM Topic Website of Nanyang Polytechnic Institute http://www.51mxd.cn/ , right-click a topic, select Check, will jump directly to where the code is located

Most of the title information is found in the tag <td>, so crawl all TD content.

And the web address has too much content and needs to be paged, so to find the paged web address (click back and forth for 2 pages and 1 page), find the format of the web address as follows:

http://www.51mxd.cn/problemset.php-page=1.htm

problemset.php-page=X.htm stands for page X, so you need to write this when entering the web address:

r = requests.get(f'http://www.51mxd.cn/problemset.php-page={pages}.htm', Headers)

3. Code

import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm

# Simulate all kinds of browser access, unexpectedly, there is also a QQ viewer
Headers = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'

# Table Head
csvHeaders = ['Title', 'difficulty', 'Title', 'Pass rate', 'Pass Number/Total Submissions']

# Topic Data
subjects = []

# Crawl Topics
print('Title information crawling:\n')
for pages in tqdm(range(1, 1 + 1)):

    # Website
    r = requests.get(f'http://www.51mxd.cn/problemset.php-page={pages}.htm', Headers)
    # Get Web Content
    r.raise_for_status()
    # Code
    r.encoding = 'utf-8'
    # Create BeautifulSoup object
    soup = BeautifulSoup(r.text, 'html5lib')
    # Find all td Tags
    td = soup.find_all('td')

    subject = []

    for t in td:
        # Get values in columns and columns to form a row
        if t.string is not None:
            subject.append(t.string)
            if len(subject) == 5:
                subjects.append(subject)
                subject = []

#Store titles
with open('NYOJ_Subjects.csv', 'w', newline='') as file:
    fileWriter = csv.writer(file)
    fileWriter.writerow(csvHeaders)
    fileWriter.writerows(subjects)

print('\n Title information crawl completed!!!')

4. Operation effect:

Contents retrieved from the <td>tag are stored in the csv file

2. Code Instances - Grab a batch of info notifications from news websites

1. Analysis Web Site

open News website of Chongqing Jiaotong University http://news.cqjtu.edu.cn/xxtz.htm

Right-click any news headline and click Check to navigate directly to its source code in the web page, which is what we need to crawl.

We found that both the headline and the release time are in the <div> tag, which is also included in a <li> tag. Then we can find all the <li> tags and find the appropriate <div> tags from inside.

Also note that pages have page numbers: http://news.cqjtu.edu.cn/xxtz/{pages}.htm

2. Code

import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm
import urllib.request, urllib.error  # Develop a URL to get web page data

# All news
subjects = []

# Simulate browser access
Headers = {  # Simulate browser header information
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53"
}

# Table Head
csvHeaders = ['time', 'Title']


print('Information crawling:\n')
for pages in tqdm(range(1, 65 + 1)):
    # Make a request
    request = urllib.request.Request(f'http://news.cqjtu.edu.cn/xxtz/{pages}.htm', headers=Headers)
    html = ""
    # Get page content if request succeeds
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    # Parse Web Page
    soup = BeautifulSoup(html, 'html5lib')

    # Store a news item
    subject = []
    # Find all li Tags
    li = soup.find_all('li')
    for l in li:
        # Find div tags that meet the criteria
        if l.find_all('div',class_="time") is not None and l.find_all('div',class_="right-title") is not None:
            # time
            for time in l.find_all('div',class_="time"):
                subject.append(time.string)
            # Title
            for title in l.find_all('div',class_="right-title"):
                for t in title.find_all('a',target="_blank"):
                    subject.append(t.string)
            if subject:
                print(subject)
                subjects.append(subject)
        subject = []

# Save data
with open('CQJTUnews.csv', 'w', newline='') as file:
    fileWriter = csv.writer(file)
    fileWriter.writerow(csvHeaders)
    fileWriter.writerows(subjects)

print('\n Information crawl complete!!!')

3. Operation effect

The crawled content is saved in a csv file.

3. Summary

With crawlers, we can easily get some up-to-date information without squatting manually anymore. If the program has been running on a cloud server, and the received messages are updated and forwarded to the mobile device in real time, it is very convenient to be a news app.

There are two premises for learning crawlers at the same time, one is to master the basic python syntax, the other is to master the basic front-end knowledge: HTML hypertext markup language.

4. Reference Articles

Dreaming: Python simple crawler

Immersed in Thousands of Dreams: Crawling the content of Chongqing Jiaotong University News Network based on python

Posted by Vince889 on Wed, 17 Nov 2021 09:13:43 -0800