python basic network crawler day04

Keywords: Big Data Firefox encoding xml Attribute

Catalog

1.xpath tool (parsing)

2.lxml library and xpath usage

day04

1.requests module method

  1. get() parameter
    1. Query parameters: params - Dictionary
    2. Agent: proxies - Dictionary
      1. General Agent: {Protocol': "Protocol: //ip Address: Port Number"}
      2. Private Agent: {Protocol': "Protocol: //User Name: Password @ip Address: Port Number"}
    3. Web client authentication: auth - tuple
      auth = ('tarenacode','code_2014')
    4. SSL certificate: verify - > Default True
    5. timeout
  2. post() method
    1. data - Dictionary, Form Form Form data
  3. Response object properties
    1. text - string
    2. encoding - res.enconding = 'utf-8'
    3. con is not finished.
    4. status_code - Server Response Code

 

2. Data persistent storage

  1. MySQL process
    1. db = pymysql.connect('local',...'library name', charset='uft8')
    2. cursor = db.cursor()
    3. cursor.execute("sql command", [])
    4. db.commit()
    5. cursor.close()
    6. db.close()
  2. mongodb process
    1. conn = pymongo.MongoClient('localhost',27017)
    2. db = conn. Library name
    3. myset = db. Collection name
    4. Myset. insert (dictionary)
      Terminal operation
      1. mongo
      2. show dbs
      3. use library name
      4. show tables
      5. db. Collection name. find().pretty()

3.Handler Processor (in urllib library)

  1. Usage flow
    1. Create Handler: ProxyHandler (General Agent IP)
    2. Create opener: build_opener(Handler)
    3. Request: opener.open(requeust)
  2. ProxyHandler
    1. ph = urllib.request.ProxyHandler({'http':'IP: ..'})
    2. opener = urllib.request.bulid_opener(ph)
    3. req = urllib.request.Request(url,headers=headers...)
    4. res = opener.open(req)
    5. html = res.read().decode('utf-8')
  3. Handler Processor Classification
    1. ProxyHandler({General Agent IP})
    2. ProxyBasicAuthHandler (Password Manager Object)
    3. HTTP Basic AuthHandler (Password Manager Object)
    4. Technological process
      1. Establish
        pwdmg = urllib.request.HTTPPasswordMgrWith...()
      2. Adding authentication information
        pwdmg.add_password(None, {Protocol":"IP:.},""user","pwd")
      3. Handler Begins

day04

1.xpath tool (parsing)

  1. xpath
    The language for finding information in xml documents is also suitable for HTML document retrieval.
  2. xpath AIDS
    1. Chrome plug-in: XPath Helper
      1. Open: Ctrl + Shift + Capital X
      2. Close: Ctrl + Shift + Capital X
    2. Firefox plug-in: Xpath checker
    3. XPath expression editing tool: XML quire
  3. xpath matching rule
    1. Matching demonstration
      1. Find all the nodes under the bookstore: / Bookstore
      2. Find all book nodes: //book
      3. Find all title nodes under book, where the lang attribute is `en': //book/title [@lang= `en']
      4. Find the title node under the second book node under the bookstore: / bookstore/book[2]/title/text()
        <?xml version="1.0" encoding="ISO-8859-1"?>
        <bookstore>
        <book>
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author> 
        <year>2005</year>
        <price>29.99</price>
        </book>
        <book>
        <title lang="chs">Python</title>
        <author>Joe</author> 
        <year>2018</year>
        <price>49.99</price>
        </book>
        </bookstore>
        

         

    2. Select node
      / Select from the root node
      // Find nodes from the entire document
      @: Select the properties of a node
           //title[@lang="en"]
    3. Use of @
      1. Select a node: //title[@lang="en"]
      2. Select n nodes: //title[@lang]
      3. Select the attribute value of the node: //title/@lang
    4. Matching multipath
      1. Symbol:
      2. Get the title node and price node under all book nodes
        //book/title | //book/price
    5. function
      1. contains(): Matches a node that contains certain strings in an attribute value
        //title[contains(@lang,'e')]
      2. text()
        //title[contains(@lang,'e')]/text()

2.lxml library and xpath usage

  1. lxml library: HTML/XML parsing library
    1. install
      python -m pip install 1xml
      conda install lxml
    2. Usage flow
      1. Guide module
        from 1xml import etree
      2. Creating parsing objects by etree module of lxml Library
        parseHtml = etree.HTML(html)
      3. Parsing object calls xpath tool to locate node information
        r_list = parseHtml.xpath('xpath expression')
        # # Once xpath is invoked, the result must be a list
    3. Example:
      from lxml import etree
      
      html = """<div class="wrapper">
      	<i class="iconfont icon-back" id="back"></i>
      	<a href="/" id="channel">Sina Society</a>
      	<ul id="nav">
      		<li><a href="http://Domestic.firefox.sina.com/"title=" domestic">domestic </a> </li>
      		<li><a href="http://World.firefox.sina.com/"title=" international">international </a> </li>
      		<li><a href="http://Mil.firefox.sina.com/"title="military">military </a> </li>
      		<li><a href="http://Photo.firefox.sina.com/"title=" picture "> picture </a> </li>"
      		<li><a href="http://Society.firefox.sina.com/"title="society">society </a> </li>
      		<li><a href="http://Ent.firefox.sina.com/"title="entertainment">entertainment </a> </li>
      		<li><a href="http://Tech.firefox.sina.com/"title="technology">technology</a></li>
      		<li><a href="http://Sports.firefox.sina.com/"title="Sports">Sports</a></li>
      		<li><a href="http://Finance.firefox.sina.com/"title="finance and economics">finance and economics </a></li>
      		<li><a href="http://Auto.firefox.sina.com/"title="automobile">automobile </a> </li>
      	</ul>
      	<i class="iconfont icon-liebiao" id="menu"></i>
      </div>"""
      # Constructing Analytical Objects
      parseHtml = etree.HTML(html)
      # Invoking xpath matching with parsed objects
      r1 = parseHtml.xpath('//a/@href')
      #print(r1)
      
      # Get /
      r2 = parseHtml.xpath('//a[@id="channel"]/@href')
      #print(r2)
      
      # Get non /
      r3 = parseHtml.xpath('//ul[@id="nav"]//a/@href')
      #print(r3)
      # Get the text content of all a nodes
      r4 = parseHtml.xpath('//a/text()')
      #print(r4)
      # Get pictures, military... 
      r5 = parseHtml.xpath('//ul[@id="nav"]//a')
      for i in r5:
          print(i.text)

       

    4. How to Get the Content of Node Objects
      Node object. text
    5. List 1: Get all the pictures in Baidu Post Bar
      1. Target: Grab all pictures of the designated post bar
      2. thinking
        1. Get the home page URL of the post bar, the next page: Find the rule of the URL
        2. Get the URL of each post on page 1
        3. Send a request to each post URL to get the picture URL in the post
        4. Make a request to the image URL and write to the local file in wb mode
      3. step
        1. Get the home page URL of the Post Bar
          http://tieba.baidu.com/f?+ Query parameters
        2. Find the URL s of all posts on the page
          src: Full Link
          href: Splicing with the main URL
            /p/5926064184
            http://tieba.baidu.com/p/5926064184
          xpath matching
          Write 1://div[@class="col2_right j_threadlist_li_right"]/div/div/a/@href
          Writing 2 (Recommendation): //div[@class="t_con cleafix"]/div/div/div/a/@href
        3. Find the picture URL in each post
          Xpath matching:
            //img[@class="BDE_Image"]/@src
        4. code implementation
          '''02_Baidu Post Bar Picture Grabbing Case.py'''
          import requests
          from lxml import etree
          
          class BaiduImageSpider:
              def __init__(self):
                  self.headers = {"User-Agent":"Mozilla/5.0"}
                  self.baseurl = "http://tieba.baidu.com"
                  self.pageurl = "http://tieba.baidu.com/f?"
                  
              # Get a list of all post URL s
              def getPageUrl(self,params):
                  res = requests.get(self.pageurl,params=params,headers=self.headers) 
                  res.encoding = "utf-8"
                  html = res.text
                  # Constructing parsing objects
                  parseHtml = etree.HTML(html)
                  # Post Link List
                  t_list = parseHtml.xpath('//div[@class="t_con cleafix"]/div/div/div/a/@href')
                  # t_list : ['/p/233432','/p/2039820',..]
                  #print(t_list)
                  for t_link in t_list:
                      # Full link of stitching posts
                      t_link = self.baseurl + t_link
                      self.getImageUrl(t_link)
              
              # Get a list of image URL s in posts
              def getImageUrl(self,t_link):
                  res = requests.get(t_link,headers=self.headers)
                  res.encoding = "utf-8"
                  html = res.text
                  # Constructing Analytical Objects
                  parseHtml = etree.HTML(html)
                  img_list = parseHtml.xpath('//img[@class="BDE_Image"]/@src')
                #  print(img_list)
                  for img_link in img_list:
                      self.writeImage(img_link)
              
              # Save it locally
              def writeImage(self,img_link):
                  # Get the bytes of the picture
                  res = requests.get(img_link,headers=self.headers)
                  res.encoding = "utf-8"
                  html = res.content
                  # filename
                  filename = img_link[-12:]
                  with open(filename,"wb") as f:
                      f.write(html)
                      print("%s Download successful" % filename)
              
              # Main function
              def workOn(self):
                  name = input("Please enter the name of the sticker bar.:")
                  begin = int(input("Please enter the start page:"))
                  end = int(input("Please enter the termination page:"))
                  
                  for n in range(begin,end+1):
                      pn = (n-1)*50
                      params = {
                              "kw":name,
                              "pn":str(pn)
                          }
                      self.getPageUrl(params)
                      
          if __name__ == "__main__":
              spider = BaiduImageSpider()
              spider.workOn()

           

    6. Case 2: Encyclopedia of Sniffing - xpath
      1. Target: User name
      2. step
        1. Looking for URL
          https://www.qiushibaike.com/8hr/page/1/
        2. xpath matching
          1. Benchmark xpath://div [contains (@id, @qiushi_tag_)]
            User nickname:. / div/a/h2
            Paragraph content:. //div[@class="content"]/span
            Funny Quantity: //i
            Number of comments: //i
      3. Write code:
        '''03_Encyclopedia of Gong Shi.py'''
        import requests
        from lxml import etree
        import pymongo
        
        class QiuShiSpider:
            def __init__(self):
                self.url = "https://www.qiushibaike.com/8hr/page/1/"
                self.headers = {"User-Agent":"Mozilla/5.0"}
                self.conn = pymongo.MongoClient("localhost",27017)
                self.db = self.conn.Baikedb
                self.myset = self.db.baikeset
                
            def getPage(self):
                res = requests.get(self.url,headers=self.headers)
                res.encoding = "utf-8"
                html = res.text
                self.parsePage(html)
            
            def parsePage(self,html):
                parseHtml = etree.HTML(html)
                # Benchmark xpath, list of each paragraph
                base_list = parseHtml.xpath('//div[contains(@id,"qiushi_tag_")]')
                # Traversing the node object (base) of each segment
                for base in base_list:
                    # nickname
                    username = base.xpath('./div/a/h2')
                    if len(username) == 0:
                        username = "anonymous"
                    else:
                        username = username[0].text
                    # Paragraph content
                    content = base.xpath('.//div[@class="content"]/span')[0].text
                    # Funny number
                    # [<element.number of laughs>, <eleme.comment>, <element...>]
                    laughNum = base.xpath('.//i')[0].text 
                    # Comment quantity
                    pingNum = base.xpath('.//i')[1].text
                    
                    d = {
                          "username":username.strip(),
                          "content":content.strip(),
                          "laughNum":laughNum.strip(),
                          "pingNum":pingNum.strip()
                      }
                    self.myset.insert(d)
                    print("Successful storage in database")
        
        if __name__ == "__main__":
            spider = QiuShiSpider()
            spider.getPage()

         

3. Dynamic Web Site Data Grabbing

  1. Ajax dynamic loading
    1. Features: Loading when scrolling mouse pulley
    2. Pack grabbing tool: Query parameters in WebForms - QueryString

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Posted by Bijan on Mon, 21 Jan 2019 08:33:13 -0800