web crawler explanation-urllib library crawler-basic usage-timeout settings-automatic simulation of http requests

Keywords: Python Mobile encoding network

Using urllib Library of python system to write simple crawler

urlopen() Gets the html source code for a URL
read() reads out the html source content
decode("utf-8") converts bytes into strings

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html').read().decode("utf-8")
print(html)
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="csrf-param" content="_csrf">
    <meta name="csrf-token" content="X1pZZnpKWnQAIGkLFisPFT4jLlJNIWMHHWM6HBBnbiwPbz4/LH1pWQ==">

Regularly retrieve the specified content of the page

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html').read().decode("utf-8")# Get HTML source code
pat = "51CTO college Python Actual battle group\((\d*?)\)"      #Regular rule, get QQ number
rst = re.compile(pat).findall(html)
print(rst)

#['325935753']

urlretrieve() downloads and saves network files locally, parameter 1 network file URL, parameter 2 save path

Python Resource Sharing Skirt: 855408893 has installation packages, learn video materials, update technology every day. Here is the gathering place of Python learners, zero foundation, advanced, welcome to click Python resource sharing

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from urllib import request
import re
import os

file_path = os.path.join(os.getcwd() + '/222.html')    #Stitching file save path
# print(file_path)
request.urlretrieve('http://Ed.51cto.com/course/8360.html', file_path)# Download this file and save it to the specified path

urlcleanup() clears memory generated by Crawlers

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from urllib import request
import re
import os

file_path = os.path.join(os.getcwd() + '/222.html')    #Stitching file save path
# print(file_path)
request.urlretrieve('http://Ed.51cto.com/course/8360.html', file_path)# Download this file and save it to the specified path
request.urlcleanup()

info() View an introduction to crawling pages

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html')# Get HTML source code
a = html.info()
print(a)

# C:\Users\admin\AppData\Local\Programs\Python\Python35\python.exe H:/py/15/chshi.py
# Date: Tue, 25 Jul 2017 16:08:17 GMT
# Content-Type: text/html; charset=UTF-8
# Transfer-Encoding: chunked
# Connection: close
# Set-Cookie: aliyungf_tc=AQAAALB8CzAikwwA9aReq63oa31pNIez; Path=/; HttpOnly
# Server: Tengine
# Vary: Accept-Encoding
# Vary: Accept-Encoding
# Vary: Accept-Encoding

getcode() gets the status code

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html')# Get HTML source code
a = html.getcode()  #Get the status code
print(a)

#200

geturl() Gets the URL of the current crawled page

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html')# Get HTML source code
a = html.geturl()  #Get the URL of the current crawled page
print(a)

#http://edu.51cto.com/course/8360.html

Timeout fetch timeout settings in seconds

When a page is crawled, the other server responds too slowly, or has not responded for a long time. Set a timeout time, and no crawling will occur beyond the timeout time.

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html',timeout=30)# Get HTML source code
a = html.geturl()  #Get the URL of the current crawled page
print(a)

#http://edu.51cto.com/course/8360.html

Automatic simulation of http requests

http requests are commonly used as get requests and post requests

get request

For example, 360 search is to get data through get request and pass the user's search keywords to the server.

So we can simulate Baidu http requests and construct keyword automatic requests.

quote() translates keywords into characters recognized by browsers. The default website cannot be Chinese.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
import re
gjc = "Mobile phone"     #Setting keywords
gjc = urllib.request.quote(gjc)         #Transcoding keywords into characters recognized by browsers, the default website cannot be Chinese
url = "https://www.so.com/s?q="+gjc# construct url address
# print(url)
html = urllib.request.urlopen(url).read().decode("utf-8")  #Get html source code
pat = "(\w*<em>\w*</em>\w*)"            #Regular access to relevant headings
rst = re.compile(pat).findall(html)
# print(rst)
for i in rst:
    print(i)                            #Loop out the captured title

    # Official website < EM > mobile phone</em>
    # Official website < EM > mobile phone</em>
    # Official website < EM > Mobile phone < / EM > Such a low price
    # Big Brand < EM > Mobile Phone < / EM > Low Price Race
    # <em> Mobile phone</em>
    # Taobao recommendation < EM > mobile phone</em>
    # <em> Mobile phone</em>
    # <em> Mobile phone</em>
    # <em> Mobile phone</em>
    # <em> Mobile phone</em>
    # Suning Easy to Buy < EM > Mobile Phone</em>
    # Buy < EM > mobile phone</em>
    # Buy < EM > mobile phone</em>

post request

urlencode() encapsulates the form data submitted by the post request with a dictionary-based key-to-value pair of form data
Request() submits a post request, parameter 1 is the url address, and parameter 2 is the encapsulated form data

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
import urllib.parse

posturl = "http://www.iqianyue.com/mypost/"
Shuju = urllib. parse. URLEncode ({ URLEncode () encapsulates the form data submitted by the post request, and the parameters are the key-value pair form data in dictionary form.
    'name': '123',
    'pass': '456'
    }).encode('utf-8')
Req = urllib. request. Request (post url, shuju)  Request () submits a post request, parameter 1 is URL address, parameter 2 is encapsulated form data
 html = urllib.request.urlopen(req).read().decode("utf-8")# Gets the page returned by the post request
print(html)

What I don't know in the process of learning can be added to me?
python learning resource qun, 855 408 893
 There are good learning video tutorials, development tools and e-books in the group.
Share with you the current talent needs of python enterprises and how to learn python from zero foundation, and what to learn

Posted by Hypnos on Mon, 12 Aug 2019 06:06:47 -0700