Using urllib Library of python system to write simple crawler
urlopen() Gets the html source code for a URL
read() reads out the html source content
decode("utf-8") converts bytes into strings
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html').read().decode("utf-8") print(html)
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="csrf-param" content="_csrf"> <meta name="csrf-token" content="X1pZZnpKWnQAIGkLFisPFT4jLlJNIWMHHWM6HBBnbiwPbz4/LH1pWQ==">
Regularly retrieve the specified content of the page
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request import re html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html').read().decode("utf-8")# Get HTML source code pat = "51CTO college Python Actual battle group\((\d*?)\)" #Regular rule, get QQ number rst = re.compile(pat).findall(html) print(rst) #['325935753']
urlretrieve() downloads and saves network files locally, parameter 1 network file URL, parameter 2 save path
Python Resource Sharing Skirt: 855408893 has installation packages, learn video materials, update technology every day. Here is the gathering place of Python learners, zero foundation, advanced, welcome to click Python resource sharing
#!/usr/bin/env python # -*- coding:utf-8 -*- from urllib import request import re import os file_path = os.path.join(os.getcwd() + '/222.html') #Stitching file save path # print(file_path) request.urlretrieve('http://Ed.51cto.com/course/8360.html', file_path)# Download this file and save it to the specified path
urlcleanup() clears memory generated by Crawlers
#!/usr/bin/env python # -*- coding:utf-8 -*- from urllib import request import re import os file_path = os.path.join(os.getcwd() + '/222.html') #Stitching file save path # print(file_path) request.urlretrieve('http://Ed.51cto.com/course/8360.html', file_path)# Download this file and save it to the specified path request.urlcleanup()
info() View an introduction to crawling pages
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request import re html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html')# Get HTML source code a = html.info() print(a) # C:\Users\admin\AppData\Local\Programs\Python\Python35\python.exe H:/py/15/chshi.py # Date: Tue, 25 Jul 2017 16:08:17 GMT # Content-Type: text/html; charset=UTF-8 # Transfer-Encoding: chunked # Connection: close # Set-Cookie: aliyungf_tc=AQAAALB8CzAikwwA9aReq63oa31pNIez; Path=/; HttpOnly # Server: Tengine # Vary: Accept-Encoding # Vary: Accept-Encoding # Vary: Accept-Encoding
getcode() gets the status code
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request import re html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html')# Get HTML source code a = html.getcode() #Get the status code print(a) #200
geturl() Gets the URL of the current crawled page
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request import re html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html')# Get HTML source code a = html.geturl() #Get the URL of the current crawled page print(a) #http://edu.51cto.com/course/8360.html
Timeout fetch timeout settings in seconds
When a page is crawled, the other server responds too slowly, or has not responded for a long time. Set a timeout time, and no crawling will occur beyond the timeout time.
#!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request import re html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html',timeout=30)# Get HTML source code a = html.geturl() #Get the URL of the current crawled page print(a) #http://edu.51cto.com/course/8360.html
Automatic simulation of http requests
http requests are commonly used as get requests and post requests
get request
For example, 360 search is to get data through get request and pass the user's search keywords to the server.
So we can simulate Baidu http requests and construct keyword automatic requests.
quote() translates keywords into characters recognized by browsers. The default website cannot be Chinese.
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib.request import re gjc = "Mobile phone" #Setting keywords gjc = urllib.request.quote(gjc) #Transcoding keywords into characters recognized by browsers, the default website cannot be Chinese url = "https://www.so.com/s?q="+gjc# construct url address # print(url) html = urllib.request.urlopen(url).read().decode("utf-8") #Get html source code pat = "(\w*<em>\w*</em>\w*)" #Regular access to relevant headings rst = re.compile(pat).findall(html) # print(rst) for i in rst: print(i) #Loop out the captured title # Official website < EM > mobile phone</em> # Official website < EM > mobile phone</em> # Official website < EM > Mobile phone < / EM > Such a low price # Big Brand < EM > Mobile Phone < / EM > Low Price Race # <em> Mobile phone</em> # Taobao recommendation < EM > mobile phone</em> # <em> Mobile phone</em> # <em> Mobile phone</em> # <em> Mobile phone</em> # <em> Mobile phone</em> # Suning Easy to Buy < EM > Mobile Phone</em> # Buy < EM > mobile phone</em> # Buy < EM > mobile phone</em>
post request
urlencode() encapsulates the form data submitted by the post request with a dictionary-based key-to-value pair of form data
Request() submits a post request, parameter 1 is the url address, and parameter 2 is the encapsulated form data
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib.request import urllib.parse posturl = "http://www.iqianyue.com/mypost/" Shuju = urllib. parse. URLEncode ({ URLEncode () encapsulates the form data submitted by the post request, and the parameters are the key-value pair form data in dictionary form. 'name': '123', 'pass': '456' }).encode('utf-8') Req = urllib. request. Request (post url, shuju) Request () submits a post request, parameter 1 is URL address, parameter 2 is encapsulated form data html = urllib.request.urlopen(req).read().decode("utf-8")# Gets the page returned by the post request print(html) What I don't know in the process of learning can be added to me? python learning resource qun, 855 408 893 There are good learning video tutorials, development tools and e-books in the group. Share with you the current talent needs of python enterprises and how to learn python from zero foundation, and what to learn