XPath (XML Path Language) is a language for finding information in XML documents, which can be used to traverse elements and attributes in XML documents.
lxml Is an HTML/XML parser, the main function is how to parse and extract HTML/XML data. lxml, like regular, also uses C is a high-performance Python HTML/XML parser. We can use the XPath syntax we learned before to quickly locate specific elements and node information.
lxml Official python document: [http://lxml.de/index.html](http://lxml.de/index.html)
You need to install the C language library. You can use the pip installation: pip install lxml (or through wheel installation)
The following code realizes to grab a picture of a marriage website and save it to the local, parse the connection address of the picture by using xpath, and then save it to the local. The code only obtains one page of pictures by analyzing the url [http://www.qyw520.com/user/list-1-0 -- 0-0-0-0-0-0-0-0-0-0-0-0 --- 0-0-0-2. HTML] the number of pages is related to the number after the list. If you need to go back to the data of multiple pages, just set an offset to control the page number.
import urllib.request import random from lxml import etree class MySpider: userName = 1 def headers(self): """ //Randomly generate the header of user agent :return: """ headers_list = [ "User-Agent:Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0", "User-Agent:Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)", "User-Agent:Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)", "Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1", "Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11", "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;TencentTraveler4.0)", "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;Maxthon2.0)", "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;360SE)", "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)", ] ua_agent = random.choice(headers_list) return ua_agent def load_page(self, url, header): headers = {"User-Agent": header} request = urllib.request.Request(url, headers=headers) response = urllib.request.urlopen(request) return response.read() def parse(self, html): # Parsing HTML to HTML document content = html.decode("utf-8") selector = etree.HTML(content) img_links = selector.xpath('//img[@class="img"]/@src') for link in img_links: self.write_img(link) def write_img(self, imgurl): print("Storing files %d ..." % self.userName) # 1. Open the file and return a file object with open('images/' + str(self.userName) + '.png', 'wb') as f: # 2. Get the content in the picture images = urllib.request.urlopen('http://www.qyw520.com' + imgurl) # 3. Call the file object write() method to write the contents of the picture to the file f.write(images.read()) print("file %d Saved successfully!" % self.userName) self.userName += 1 def main(self, url): header = self.headers() html = self.load_page(url, header) self.parse(html) if __name__ == "__main__": url = "http://www.qyw520.com/user/list-1-0--0-0-0-0-0-0-0-0-0-0-0-0-0---0-0-0-2.html" myspider = MySpider() myspider.main(url)