python crawler (xpath parsing web page, download photos)

Keywords: xml Python pip C

XPath (XML Path Language) is a language for finding information in XML documents, which can be used to traverse elements and attributes in XML documents.

lxml Is an HTML/XML parser, the main function is how to parse and extract HTML/XML data. lxml, like regular, also uses C is a high-performance Python HTML/XML parser. We can use the XPath syntax we learned before to quickly locate specific elements and node information.

lxml Official python document: [http://lxml.de/index.html](http://lxml.de/index.html)

You need to install the C language library. You can use the pip installation: pip install lxml (or through wheel installation)

The following code realizes to grab a picture of a marriage website and save it to the local, parse the connection address of the picture by using xpath, and then save it to the local. The code only obtains one page of pictures by analyzing the url [http://www.qyw520.com/user/list-1-0 -- 0-0-0-0-0-0-0-0-0-0-0-0 --- 0-0-0-2. HTML] the number of pages is related to the number after the list. If you need to go back to the data of multiple pages, just set an offset to control the page number.

import urllib.request
import random
from lxml import etree


class MySpider:

    userName = 1
    
    def headers(self):
        """
        //Randomly generate the header of user agent
        :return:
        """
        headers_list = [
            "User-Agent:Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0",
            "User-Agent:Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)",
            "User-Agent:Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)",
            "Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1",
            "Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11",
            "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;TencentTraveler4.0)",
            "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;Maxthon2.0)",
            "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;360SE)",
            "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)",
        ]
        ua_agent = random.choice(headers_list)
        return ua_agent

    def load_page(self, url, header):

        headers = {"User-Agent": header}
        request = urllib.request.Request(url, headers=headers)
        response = urllib.request.urlopen(request)

        return response.read()

    def parse(self, html):
        # Parsing HTML to HTML document
        content = html.decode("utf-8")

        selector = etree.HTML(content)
        img_links = selector.xpath('//img[@class="img"]/@src')

        for link in img_links:
            self.write_img(link)

    def write_img(self, imgurl):
        print("Storing files %d ..." % self.userName)
        # 1. Open the file and return a file object
        with open('images/' + str(self.userName) + '.png', 'wb') as f:
            # 2. Get the content in the picture
            images = urllib.request.urlopen('http://www.qyw520.com' + imgurl)
            # 3. Call the file object write() method to write the contents of the picture to the file
            f.write(images.read())
        print("file %d Saved successfully!" % self.userName)
        self.userName += 1

    def main(self, url):
        header = self.headers()
        html = self.load_page(url, header)
        self.parse(html)

if __name__ == "__main__":

    url = "http://www.qyw520.com/user/list-1-0--0-0-0-0-0-0-0-0-0-0-0-0-0---0-0-0-2.html"

    myspider = MySpider()
    myspider.main(url)



Posted by HERATHEIM on Fri, 24 Apr 2020 08:22:55 -0700