python basic network crawler day04

Keywords: Big Data Firefox encoding xml Attribute

Catalog

1.xpath tool (parsing)

2.lxml library and xpath usage

day04

1.requests module method

get() parameter
1. Query parameters: params - Dictionary
2. Agent: proxies - Dictionary
  1. General Agent: {Protocol': "Protocol: //ip Address: Port Number"}
  2. Private Agent: {Protocol': "Protocol: //User Name: Password @ip Address: Port Number"}
3. Web client authentication: auth - tuple
  auth = ('tarenacode','code_2014')
4. SSL certificate: verify - > Default True
5. timeout
post() method
1. data - Dictionary, Form Form Form data
Response object properties
1. text - string
2. encoding - res.enconding = 'utf-8'
3. con is not finished.
4. status_code - Server Response Code

2. Data persistent storage

MySQL process
1. db = pymysql.connect('local',...'library name', charset='uft8')
2. cursor = db.cursor()
3. cursor.execute("sql command", [])
4. db.commit()
5. cursor.close()
6. db.close()
mongodb process
1. conn = pymongo.MongoClient('localhost',27017)
2. db = conn. Library name
3. myset = db. Collection name
4. Myset. insert (dictionary)
  Terminal operation
  1. mongo
  2. show dbs
  3. use library name
  4. show tables
  5. db. Collection name. find().pretty()

3.Handler Processor (in urllib library)

Usage flow
1. Create Handler: ProxyHandler (General Agent IP)
2. Create opener: build_opener(Handler)
3. Request: opener.open(requeust)
ProxyHandler
1. ph = urllib.request.ProxyHandler({'http':'IP: ..'})
2. opener = urllib.request.bulid_opener(ph)
3. req = urllib.request.Request(url,headers=headers...)
4. res = opener.open(req)
5. html = res.read().decode('utf-8')
Handler Processor Classification
1. ProxyHandler({General Agent IP})
2. ProxyBasicAuthHandler (Password Manager Object)
3. HTTP Basic AuthHandler (Password Manager Object)
4. Technological process
  1. Establish
    pwdmg = urllib.request.HTTPPasswordMgrWith...()
  2. Adding authentication information
    pwdmg.add_password(None, {Protocol":"IP:.},""user","pwd")
  3. Handler Begins

day04

1.xpath tool (parsing)

xpath
The language for finding information in xml documents is also suitable for HTML document retrieval.
xpath AIDS
1. Chrome plug-in: XPath Helper
  1. Open: Ctrl + Shift + Capital X
  2. Close: Ctrl + Shift + Capital X
2. Firefox plug-in: Xpath checker
3. XPath expression editing tool: XML quire
xpath matching rule
1. Matching demonstration
  1. Find all the nodes under the bookstore: / Bookstore
  2. Find all book nodes: //book
  3. Find all title nodes under book, where the lang attribute is `en': //book/title [@lang= `en']
  4. Find the title node under the second book node under the bookstore: / bookstore/book[2]/title/text()
```
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author> 
<year>2005</year>
<price>29.99</price>
</book>
<book>
<title lang="chs">Python</title>
<author>Joe</author> 
<year>2018</year>
<price>49.99</price>
</book>
</bookstore>
```
2. Select node
  / Select from the root node
  // Find nodes from the entire document
  @: Select the properties of a node
  //title[@lang="en"]
3. Use of @
  1. Select a node: //title[@lang="en"]
  2. Select n nodes: //title[@lang]
  3. Select the attribute value of the node: //title/@lang
4. Matching multipath
  1. Symbol:
  2. Get the title node and price node under all book nodes
    //book/title | //book/price
5. function
  1. contains(): Matches a node that contains certain strings in an attribute value
    //title[contains(@lang,'e')]
  2. text()
    //title[contains(@lang,'e')]/text()

2.lxml library and xpath usage

lxml library: HTML/XML parsing library

install
python -m pip install 1xml
conda install lxml
Usage flow
1. Guide module
  from 1xml import etree
2. Creating parsing objects by etree module of lxml Library
  parseHtml = etree.HTML(html)
3. Parsing object calls xpath tool to locate node information
  r_list = parseHtml.xpath('xpath expression')
  # # Once xpath is invoked, the result must be a list

Example:

from lxml import etree

html = """<div class="wrapper">
	<i class="iconfont icon-back" id="back"></i>
	<a href="/" id="channel">Sina Society</a>
	<ul id="nav">
		<li><a href="http://Domestic.firefox.sina.com/"title=" domestic">domestic </a> </li>
		<li><a href="http://World.firefox.sina.com/"title=" international">international </a> </li>
		<li><a href="http://Mil.firefox.sina.com/"title="military">military </a> </li>
		<li><a href="http://Photo.firefox.sina.com/"title=" picture "> picture </a> </li>"
		<li><a href="http://Society.firefox.sina.com/"title="society">society </a> </li>
		<li><a href="http://Ent.firefox.sina.com/"title="entertainment">entertainment </a> </li>
		<li><a href="http://Tech.firefox.sina.com/"title="technology">technology</a></li>
		<li><a href="http://Sports.firefox.sina.com/"title="Sports">Sports</a></li>
		<li><a href="http://Finance.firefox.sina.com/"title="finance and economics">finance and economics </a></li>
		<li><a href="http://Auto.firefox.sina.com/"title="automobile">automobile </a> </li>
	</ul>
	<i class="iconfont icon-liebiao" id="menu"></i>
</div>"""
# Constructing Analytical Objects
parseHtml = etree.HTML(html)
# Invoking xpath matching with parsed objects
r1 = parseHtml.xpath('//a/@href')
#print(r1)

# Get /
r2 = parseHtml.xpath('//a[@id="channel"]/@href')
#print(r2)

# Get non /
r3 = parseHtml.xpath('//ul[@id="nav"]//a/@href')
#print(r3)
# Get the text content of all a nodes
r4 = parseHtml.xpath('//a/text()')
#print(r4)
# Get pictures, military... 
r5 = parseHtml.xpath('//ul[@id="nav"]//a')
for i in r5:
    print(i.text)

How to Get the Content of Node Objects
Node object. text

List 1: Get all the pictures in Baidu Post Bar

Target: Grab all pictures of the designated post bar
thinking
1. Get the home page URL of the post bar, the next page: Find the rule of the URL
2. Get the URL of each post on page 1
3. Send a request to each post URL to get the picture URL in the post
4. Make a request to the image URL and write to the local file in wb mode

step

Get the home page URL of the Post Bar
http://tieba.baidu.com/f?+ Query parameters
Find the URL s of all posts on the page
src: Full Link
href: Splicing with the main URL
/p/5926064184
http://tieba.baidu.com/p/5926064184
xpath matching
Write 1://div[@class="col2_right j_threadlist_li_right"]/div/div/a/@href
Writing 2 (Recommendation): //div[@class="t_con cleafix"]/div/div/div/a/@href
Find the picture URL in each post
Xpath matching:
//img[@class="BDE_Image"]/@src

code implementation

'''02_Baidu Post Bar Picture Grabbing Case.py'''
import requests
from lxml import etree

class BaiduImageSpider:
    def __init__(self):
        self.headers = {"User-Agent":"Mozilla/5.0"}
        self.baseurl = "http://tieba.baidu.com"
        self.pageurl = "http://tieba.baidu.com/f?"
        
    # Get a list of all post URL s
    def getPageUrl(self,params):
        res = requests.get(self.pageurl,params=params,headers=self.headers) 
        res.encoding = "utf-8"
        html = res.text
        # Constructing parsing objects
        parseHtml = etree.HTML(html)
        # Post Link List
        t_list = parseHtml.xpath('//div[@class="t_con cleafix"]/div/div/div/a/@href')
        # t_list : ['/p/233432','/p/2039820',..]
        #print(t_list)
        for t_link in t_list:
            # Full link of stitching posts
            t_link = self.baseurl + t_link
            self.getImageUrl(t_link)
    
    # Get a list of image URL s in posts
    def getImageUrl(self,t_link):
        res = requests.get(t_link,headers=self.headers)
        res.encoding = "utf-8"
        html = res.text
        # Constructing Analytical Objects
        parseHtml = etree.HTML(html)
        img_list = parseHtml.xpath('//img[@class="BDE_Image"]/@src')
      #  print(img_list)
        for img_link in img_list:
            self.writeImage(img_link)
    
    # Save it locally
    def writeImage(self,img_link):
        # Get the bytes of the picture
        res = requests.get(img_link,headers=self.headers)
        res.encoding = "utf-8"
        html = res.content
        # filename
        filename = img_link[-12:]
        with open(filename,"wb") as f:
            f.write(html)
            print("%s Download successful" % filename)
    
    # Main function
    def workOn(self):
        name = input("Please enter the name of the sticker bar.:")
        begin = int(input("Please enter the start page:"))
        end = int(input("Please enter the termination page:"))
        
        for n in range(begin,end+1):
            pn = (n-1)*50
            params = {
                    "kw":name,
                    "pn":str(pn)
                }
            self.getPageUrl(params)
            
if __name__ == "__main__":
    spider = BaiduImageSpider()
    spider.workOn()

Case 2: Encyclopedia of Sniffing - xpath

Target: User name
step
1. Looking for URL
  https://www.qiushibaike.com/8hr/page/1/
2. xpath matching
  1. Benchmark xpath://div [contains (@id, @qiushi_tag_)]
    User nickname:. / div/a/h2
    Paragraph content:. //div[@class="content"]/span
    Funny Quantity: //i
    Number of comments: //i

Write code:

'''03_Encyclopedia of Gong Shi.py'''
import requests
from lxml import etree
import pymongo

class QiuShiSpider:
    def __init__(self):
        self.url = "https://www.qiushibaike.com/8hr/page/1/"
        self.headers = {"User-Agent":"Mozilla/5.0"}
        self.conn = pymongo.MongoClient("localhost",27017)
        self.db = self.conn.Baikedb
        self.myset = self.db.baikeset
        
    def getPage(self):
        res = requests.get(self.url,headers=self.headers)
        res.encoding = "utf-8"
        html = res.text
        self.parsePage(html)
    
    def parsePage(self,html):
        parseHtml = etree.HTML(html)
        # Benchmark xpath, list of each paragraph
        base_list = parseHtml.xpath('//div[contains(@id,"qiushi_tag_")]')
        # Traversing the node object (base) of each segment
        for base in base_list:
            # nickname
            username = base.xpath('./div/a/h2')
            if len(username) == 0:
                username = "anonymous"
            else:
                username = username[0].text
            # Paragraph content
            content = base.xpath('.//div[@class="content"]/span')[0].text
            # Funny number
            # [<element.number of laughs>, <eleme.comment>, <element...>]
            laughNum = base.xpath('.//i')[0].text 
            # Comment quantity
            pingNum = base.xpath('.//i')[1].text
            
            d = {
                  "username":username.strip(),
                  "content":content.strip(),
                  "laughNum":laughNum.strip(),
                  "pingNum":pingNum.strip()
              }
            self.myset.insert(d)
            print("Successful storage in database")

if __name__ == "__main__":
    spider = QiuShiSpider()
    spider.getPage()

3. Dynamic Web Site Data Grabbing

Ajax dynamic loading
1. Features: Loading when scrolling mouse pulley
2. Pack grabbing tool: Query parameters in WebForms - QueryString

Posted by Bijan on Mon, 21 Jan 2019 08:33:13 -0800

Programmer Group

python basic network crawler day04

1.xpath tool (parsing)

2.lxml library and xpath usage

Hot Keywords