Learn from day004 --- transfer from Python distributed crawler to create search engine Scrapy

Keywords: Python Mobile encoding network

Section 327, web crawler introduction 2 - urllib library crawler

Write a simple crawler using the urlib Library of python system

urlopen() gets the html source of a URL
read() reads the html source content
decode("utf-8") converts bytes to strings

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html').read().decode("utf-8")
<!DOCTYPE html>
<html lang="zh-CN">
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="csrf-param" content="_csrf">
    <meta name="csrf-token" content="X1pZZnpKWnQAIGkLFisPFT4jLlJNIWMHHWM6HBBnbiwPbz4/LH1pWQ==">

Regular get page specific content

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html').read().decode("utf-8"). Get html source code
pat = "51CTO college Python Actual combat group\((\d*?)\)"      #Regular rule, get QQ number
rst = re.compile(pat).findall(html)


urlretrieve() saves the network file download to the local. Parameter 1 is the network file URL, and parameter 2 is the save path

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from urllib import request
import re
import os

file_path = os.path.join(os.getcwd() + '/222.html')    #Save path of splicing file
# print(file_path)
request.urlretrieve('http://edu.51cto.com/course/8360.html', file_path) #Download this file and save it to the specified path

Urlleanup() clears the memory generated by the crawler

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from urllib import request
import re
import os

file_path = os.path.join(os.getcwd() + '/222.html')    #Save path of splicing file
# print(file_path)
request.urlretrieve('http://edu.51cto.com/course/8360.html', file_path) #Download this file and save it to the specified path

info() to view a brief introduction to the crawl page

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html')   #Get html source code
a = html.info()

# C:\Users\admin\AppData\Local\Programs\Python\Python35\python.exe H:/py/15/chshi.py
# Date: Tue, 25 Jul 2017 16:08:17 GMT
# Content-Type: text/html; charset=UTF-8
# Transfer-Encoding: chunked
# Connection: close
# Set-Cookie: aliyungf_tc=AQAAALB8CzAikwwA9aReq63oa31pNIez; Path=/; HttpOnly
# Server: Tengine
# Vary: Accept-Encoding
# Vary: Accept-Encoding
# Vary: Accept-Encoding

getcode() get status code

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html')   #Get html source code
a = html.getcode()  #Get status code


geturl() gets the URL of the currently fetched page

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html')   #Get html source code
a = html.geturl()  #Get the URL of the currently fetched page


Timeout fetching timeout, in seconds

It means that when a page is crawled, the other server responds too slowly or does not respond for a long time. Set a timeout period. If the timeout period is exceeded, the page will not be crawled

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html',timeout=30) ා get html source code
a = html.geturl()  #Get the URL of the currently fetched page


Automatically impersonate http requests

The common http requests are get requests and post requests

get request

For example, 360 search is to obtain data through get request and transfer the user's search keywords to the server

So we can simulate Baidu http request and construct keyword automatic request

quote() transcode keywords into characters recognized by the browser. The default website cannot be Chinese

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
import re
gjc = "Mobile phone"     #Set keywords
gjc = urllib.request.quote(gjc)         #Transcode keywords into characters recognized by the browser. The default website cannot be Chinese
url = "https://www.so.com/s?q="+gjc     #Construct url address
# print(url)
html = urllib.request.urlopen(url).read().decode("utf-8")  #Get html source code
pat = "(\w*<em>\w*</em>\w*)"            #Get related titles regularly
rst = re.compile(pat).findall(html)
# print(rst)
for i in rst:
    print(i)                            #Loop out the obtained title

    # Official website < EM > mobile phone < / EM >
    # Official website < EM > mobile phone < / EM >
    # Official website < EM > mobile phone < / EM > such a low price
    # Big brand < EM > mobile phone < / EM > low price competition
    # < EM > mobile < / EM >
    # Taobao recommends < EM > mobile phone < / EM >
    # < EM > mobile < / EM >
    # < EM > mobile < / EM >
    # < EM > mobile < / EM >
    # < EM > mobile < / EM >
    # Su Ningyi purchases < EM > mobile phone < / EM >
    # Buy < EM > mobile phone < / EM >
    # Buy < EM > mobile phone < / EM >

post request

urlencode() encapsulates the form data submitted by the post request. The parameter is the key value pair form data in the form of a dictionary
Request() submits the post request. Parameter 1 is the url address, and parameter 2 is the encapsulated form data

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
import urllib.parse

posturl = "http://www.iqianyue.com/mypost/"
shuju = urllib.parse.urlencode({                #urlencode() encapsulates the form data submitted by the post request. The parameter is the key value pair form data in the form of a dictionary
    'name': '123',
    'pass': '456'
req = urllib.request.Request(posturl,shuju)     #Request() submits the post request. Parameter 1 is the url address, and parameter 2 is the encapsulated form data
html = urllib.request.urlopen(req).read().decode("utf-8")  #Get the page returned by the post request
Published 6 original articles, won praise 0, visited 14
Private letter follow

Posted by _DarkLink_ on Mon, 17 Feb 2020 00:38:08 -0800