Section 327, web crawler introduction 2 - urllib library crawler
Write a simple crawler using the urlib Library of python system
urlopen() gets the html source of a URL
read() reads the html source content
decode("utf-8") converts bytes to strings
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html').read().decode("utf-8") print(html)
<!DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="csrf-param" content="_csrf"> <meta name="csrf-token" content="X1pZZnpKWnQAIGkLFisPFT4jLlJNIWMHHWM6HBBnbiwPbz4/LH1pWQ==">
Regular get page specific content
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request import re html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html').read().decode("utf-8"). Get html source code pat = "51CTO college Python Actual combat group\((\d*?)\)" #Regular rule, get QQ number rst = re.compile(pat).findall(html) print(rst) #['325935753']
urlretrieve() saves the network file download to the local. Parameter 1 is the network file URL, and parameter 2 is the save path
#!/usr/bin/env python # -*- coding:utf-8 -*- from urllib import request import re import os file_path = os.path.join(os.getcwd() + '/222.html') #Save path of splicing file # print(file_path) request.urlretrieve('http://edu.51cto.com/course/8360.html', file_path) #Download this file and save it to the specified path
Urlleanup() clears the memory generated by the crawler
#!/usr/bin/env python # -*- coding:utf-8 -*- from urllib import request import re import os file_path = os.path.join(os.getcwd() + '/222.html') #Save path of splicing file # print(file_path) request.urlretrieve('http://edu.51cto.com/course/8360.html', file_path) #Download this file and save it to the specified path request.urlcleanup()
info() to view a brief introduction to the crawl page
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request import re html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html') #Get html source code a = html.info() print(a) # C:\Users\admin\AppData\Local\Programs\Python\Python35\python.exe H:/py/15/chshi.py # Date: Tue, 25 Jul 2017 16:08:17 GMT # Content-Type: text/html; charset=UTF-8 # Transfer-Encoding: chunked # Connection: close # Set-Cookie: aliyungf_tc=AQAAALB8CzAikwwA9aReq63oa31pNIez; Path=/; HttpOnly # Server: Tengine # Vary: Accept-Encoding # Vary: Accept-Encoding # Vary: Accept-Encoding
getcode() get status code
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request import re html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html') #Get html source code a = html.getcode() #Get status code print(a) #200
geturl() gets the URL of the currently fetched page
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request import re html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html') #Get html source code a = html.geturl() #Get the URL of the currently fetched page print(a) #http://edu.51cto.com/course/8360.html
Timeout fetching timeout, in seconds
It means that when a page is crawled, the other server responds too slowly or does not respond for a long time. Set a timeout period. If the timeout period is exceeded, the page will not be crawled
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request import re html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html',timeout=30) ා get html source code a = html.geturl() #Get the URL of the currently fetched page print(a) #http://edu.51cto.com/course/8360.html
Automatically impersonate http requests
The common http requests are get requests and post requests
get request
For example, 360 search is to obtain data through get request and transfer the user's search keywords to the server
So we can simulate Baidu http request and construct keyword automatic request
quote() transcode keywords into characters recognized by the browser. The default website cannot be Chinese
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib.request import re gjc = "Mobile phone" #Set keywords gjc = urllib.request.quote(gjc) #Transcode keywords into characters recognized by the browser. The default website cannot be Chinese url = "https://www.so.com/s?q="+gjc #Construct url address # print(url) html = urllib.request.urlopen(url).read().decode("utf-8") #Get html source code pat = "(\w*<em>\w*</em>\w*)" #Get related titles regularly rst = re.compile(pat).findall(html) # print(rst) for i in rst: print(i) #Loop out the obtained title # Official website < EM > mobile phone < / EM > # Official website < EM > mobile phone < / EM > # Official website < EM > mobile phone < / EM > such a low price # Big brand < EM > mobile phone < / EM > low price competition # < EM > mobile < / EM > # Taobao recommends < EM > mobile phone < / EM > # < EM > mobile < / EM > # < EM > mobile < / EM > # < EM > mobile < / EM > # < EM > mobile < / EM > # Su Ningyi purchases < EM > mobile phone < / EM > # Buy < EM > mobile phone < / EM > # Buy < EM > mobile phone < / EM >
post request
urlencode() encapsulates the form data submitted by the post request. The parameter is the key value pair form data in the form of a dictionary
Request() submits the post request. Parameter 1 is the url address, and parameter 2 is the encapsulated form data
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib.request import urllib.parse posturl = "http://www.iqianyue.com/mypost/" shuju = urllib.parse.urlencode({ #urlencode() encapsulates the form data submitted by the post request. The parameter is the key value pair form data in the form of a dictionary 'name': '123', 'pass': '456' }).encode('utf-8') req = urllib.request.Request(posturl,shuju) #Request() submits the post request. Parameter 1 is the url address, and parameter 2 is the encapsulated form data html = urllib.request.urlopen(req).read().decode("utf-8") #Get the page returned by the post request print(html)