Catalog
2.lxml library and xpath usage
day04
1.requests module method
- get() parameter
- Query parameters: params - Dictionary
- Agent: proxies - Dictionary
- General Agent: {Protocol': "Protocol: //ip Address: Port Number"}
- Private Agent: {Protocol': "Protocol: //User Name: Password @ip Address: Port Number"}
- Web client authentication: auth - tuple
auth = ('tarenacode','code_2014') - SSL certificate: verify - > Default True
- timeout
- post() method
- data - Dictionary, Form Form Form data
- Response object properties
- text - string
- encoding - res.enconding = 'utf-8'
- con is not finished.
- status_code - Server Response Code
2. Data persistent storage
- MySQL process
- db = pymysql.connect('local',...'library name', charset='uft8')
- cursor = db.cursor()
- cursor.execute("sql command", [])
- db.commit()
- cursor.close()
- db.close()
- mongodb process
- conn = pymongo.MongoClient('localhost',27017)
- db = conn. Library name
- myset = db. Collection name
- Myset. insert (dictionary)
Terminal operation- mongo
- show dbs
- use library name
- show tables
- db. Collection name. find().pretty()
3.Handler Processor (in urllib library)
- Usage flow
- Create Handler: ProxyHandler (General Agent IP)
- Create opener: build_opener(Handler)
- Request: opener.open(requeust)
- ProxyHandler
- ph = urllib.request.ProxyHandler({'http':'IP: ..'})
- opener = urllib.request.bulid_opener(ph)
- req = urllib.request.Request(url,headers=headers...)
- res = opener.open(req)
- html = res.read().decode('utf-8')
- Handler Processor Classification
- ProxyHandler({General Agent IP})
- ProxyBasicAuthHandler (Password Manager Object)
- HTTP Basic AuthHandler (Password Manager Object)
- Technological process
- Establish
pwdmg = urllib.request.HTTPPasswordMgrWith...() - Adding authentication information
pwdmg.add_password(None, {Protocol":"IP:.},""user","pwd") - Handler Begins
- Establish
day04
1.xpath tool (parsing)
- xpath
The language for finding information in xml documents is also suitable for HTML document retrieval. - xpath AIDS
- Chrome plug-in: XPath Helper
- Open: Ctrl + Shift + Capital X
- Close: Ctrl + Shift + Capital X
- Firefox plug-in: Xpath checker
- XPath expression editing tool: XML quire
- Chrome plug-in: XPath Helper
- xpath matching rule
- Matching demonstration
- Find all the nodes under the bookstore: / Bookstore
- Find all book nodes: //book
- Find all title nodes under book, where the lang attribute is `en': //book/title [@lang= `en']
- Find the title node under the second book node under the bookstore: / bookstore/book[2]/title/text()
<?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book> <title lang="chs">Python</title> <author>Joe</author> <year>2018</year> <price>49.99</price> </book> </bookstore>
- Select node
/ Select from the root node
// Find nodes from the entire document
@: Select the properties of a node
//title[@lang="en"] - Use of @
- Select a node: //title[@lang="en"]
- Select n nodes: //title[@lang]
- Select the attribute value of the node: //title/@lang
- Matching multipath
- Symbol:
- Get the title node and price node under all book nodes
//book/title | //book/price
- function
- contains(): Matches a node that contains certain strings in an attribute value
//title[contains(@lang,'e')] - text()
//title[contains(@lang,'e')]/text()
- contains(): Matches a node that contains certain strings in an attribute value
- Matching demonstration
2.lxml library and xpath usage
- lxml library: HTML/XML parsing library
- install
python -m pip install 1xml
conda install lxml - Usage flow
- Guide module
from 1xml import etree - Creating parsing objects by etree module of lxml Library
parseHtml = etree.HTML(html) - Parsing object calls xpath tool to locate node information
r_list = parseHtml.xpath('xpath expression')
# # Once xpath is invoked, the result must be a list
- Guide module
- Example:
from lxml import etree html = """<div class="wrapper"> <i class="iconfont icon-back" id="back"></i> <a href="/" id="channel">Sina Society</a> <ul id="nav"> <li><a href="http://Domestic.firefox.sina.com/"title=" domestic">domestic </a> </li> <li><a href="http://World.firefox.sina.com/"title=" international">international </a> </li> <li><a href="http://Mil.firefox.sina.com/"title="military">military </a> </li> <li><a href="http://Photo.firefox.sina.com/"title=" picture "> picture </a> </li>" <li><a href="http://Society.firefox.sina.com/"title="society">society </a> </li> <li><a href="http://Ent.firefox.sina.com/"title="entertainment">entertainment </a> </li> <li><a href="http://Tech.firefox.sina.com/"title="technology">technology</a></li> <li><a href="http://Sports.firefox.sina.com/"title="Sports">Sports</a></li> <li><a href="http://Finance.firefox.sina.com/"title="finance and economics">finance and economics </a></li> <li><a href="http://Auto.firefox.sina.com/"title="automobile">automobile </a> </li> </ul> <i class="iconfont icon-liebiao" id="menu"></i> </div>""" # Constructing Analytical Objects parseHtml = etree.HTML(html) # Invoking xpath matching with parsed objects r1 = parseHtml.xpath('//a/@href') #print(r1) # Get / r2 = parseHtml.xpath('//a[@id="channel"]/@href') #print(r2) # Get non / r3 = parseHtml.xpath('//ul[@id="nav"]//a/@href') #print(r3) # Get the text content of all a nodes r4 = parseHtml.xpath('//a/text()') #print(r4) # Get pictures, military... r5 = parseHtml.xpath('//ul[@id="nav"]//a') for i in r5: print(i.text)
- How to Get the Content of Node Objects
Node object. text - List 1: Get all the pictures in Baidu Post Bar
- Target: Grab all pictures of the designated post bar
- thinking
- Get the home page URL of the post bar, the next page: Find the rule of the URL
- Get the URL of each post on page 1
- Send a request to each post URL to get the picture URL in the post
- Make a request to the image URL and write to the local file in wb mode
- step
- Get the home page URL of the Post Bar
http://tieba.baidu.com/f?+ Query parameters - Find the URL s of all posts on the page
src: Full Link
href: Splicing with the main URL
/p/5926064184
http://tieba.baidu.com/p/5926064184
xpath matching
Write 1://div[@class="col2_right j_threadlist_li_right"]/div/div/a/@href
Writing 2 (Recommendation): //div[@class="t_con cleafix"]/div/div/div/a/@href - Find the picture URL in each post
Xpath matching:
//img[@class="BDE_Image"]/@src - code implementation
'''02_Baidu Post Bar Picture Grabbing Case.py''' import requests from lxml import etree class BaiduImageSpider: def __init__(self): self.headers = {"User-Agent":"Mozilla/5.0"} self.baseurl = "http://tieba.baidu.com" self.pageurl = "http://tieba.baidu.com/f?" # Get a list of all post URL s def getPageUrl(self,params): res = requests.get(self.pageurl,params=params,headers=self.headers) res.encoding = "utf-8" html = res.text # Constructing parsing objects parseHtml = etree.HTML(html) # Post Link List t_list = parseHtml.xpath('//div[@class="t_con cleafix"]/div/div/div/a/@href') # t_list : ['/p/233432','/p/2039820',..] #print(t_list) for t_link in t_list: # Full link of stitching posts t_link = self.baseurl + t_link self.getImageUrl(t_link) # Get a list of image URL s in posts def getImageUrl(self,t_link): res = requests.get(t_link,headers=self.headers) res.encoding = "utf-8" html = res.text # Constructing Analytical Objects parseHtml = etree.HTML(html) img_list = parseHtml.xpath('//img[@class="BDE_Image"]/@src') # print(img_list) for img_link in img_list: self.writeImage(img_link) # Save it locally def writeImage(self,img_link): # Get the bytes of the picture res = requests.get(img_link,headers=self.headers) res.encoding = "utf-8" html = res.content # filename filename = img_link[-12:] with open(filename,"wb") as f: f.write(html) print("%s Download successful" % filename) # Main function def workOn(self): name = input("Please enter the name of the sticker bar.:") begin = int(input("Please enter the start page:")) end = int(input("Please enter the termination page:")) for n in range(begin,end+1): pn = (n-1)*50 params = { "kw":name, "pn":str(pn) } self.getPageUrl(params) if __name__ == "__main__": spider = BaiduImageSpider() spider.workOn()
- Get the home page URL of the Post Bar
- Case 2: Encyclopedia of Sniffing - xpath
- Target: User name
- step
- Looking for URL
https://www.qiushibaike.com/8hr/page/1/ - xpath matching
- Benchmark xpath://div [contains (@id, @qiushi_tag_)]
User nickname:. / div/a/h2
Paragraph content:. //div[@class="content"]/span
Funny Quantity: //i
Number of comments: //i
- Benchmark xpath://div [contains (@id, @qiushi_tag_)]
- Looking for URL
- Write code:
'''03_Encyclopedia of Gong Shi.py''' import requests from lxml import etree import pymongo class QiuShiSpider: def __init__(self): self.url = "https://www.qiushibaike.com/8hr/page/1/" self.headers = {"User-Agent":"Mozilla/5.0"} self.conn = pymongo.MongoClient("localhost",27017) self.db = self.conn.Baikedb self.myset = self.db.baikeset def getPage(self): res = requests.get(self.url,headers=self.headers) res.encoding = "utf-8" html = res.text self.parsePage(html) def parsePage(self,html): parseHtml = etree.HTML(html) # Benchmark xpath, list of each paragraph base_list = parseHtml.xpath('//div[contains(@id,"qiushi_tag_")]') # Traversing the node object (base) of each segment for base in base_list: # nickname username = base.xpath('./div/a/h2') if len(username) == 0: username = "anonymous" else: username = username[0].text # Paragraph content content = base.xpath('.//div[@class="content"]/span')[0].text # Funny number # [<element.number of laughs>, <eleme.comment>, <element...>] laughNum = base.xpath('.//i')[0].text # Comment quantity pingNum = base.xpath('.//i')[1].text d = { "username":username.strip(), "content":content.strip(), "laughNum":laughNum.strip(), "pingNum":pingNum.strip() } self.myset.insert(d) print("Successful storage in database") if __name__ == "__main__": spider = QiuShiSpider() spider.getPage()
- install
3. Dynamic Web Site Data Grabbing
- Ajax dynamic loading
- Features: Loading when scrolling mouse pulley
- Pack grabbing tool: Query parameters in WebForms - QueryString