Reptile [2] use of beautiful soup Library

Keywords: Python Attribute Linux Android

Preface:

In the previous article, we used the urlib and re libraries of python to complete the introduction case of crawler

However, because regular expressions are difficult to master, we use a third-party library: beautiful soup to intercept the content of the web page

text

One, download and install beautiful soup

If python3.x has pip3 installed, you can use the pip3 command line to install beautifulsop

pip3 install beautifulsoup4

If pip3 is not installed, it can also be installed through the source code

Download source package( Click to download ), enter the package directory and install from the command line

python setup.py install

After installation, enter python to test whether the installation is successful

from bs4 import BeautifulSoup

If there is no error, the installation is successful.

For the Chinese documents of bs4, please refer to the page: http://beautifulsoup.readthedocs.io/zh_CN/latest/

Two, use

Let's first look at the basic usage of bs

import urllib.request
from bs4 import BeautifulSoup
 
def getHtmlCode(url):  # This method passes in the url and returns the source code of the html corresponding to the url
    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36'
    }
    url1 = urllib.request.Request(url, headers=headers) # The Request function adds a header to the url to simulate browser access
    page = urllib.request.urlopen(url1).read()  # Save the source code of the url page as a string
    page = page.decode('UTF-8')  # String transcoding
    return page
 
 
if __name__ == '__main__':
    html = getHtmlCode('http://www.zhangzishi.cc/20160413hx.html')
    soup = BeautifulSoup(html,"html.parser") # The beautifulsop class parses the url source code and returns an object
    print(type(soup))
    print(soup.prettify())   # Output according to the structure of standard indentation format
    print(soup.p)    # The first p tag in the output source code
    print(soup.p.name)  # Output label name p
    print(soup.title.string)  # Output label content
 
    # The first way to get the content of tag attribute
    print(soup.a['href'])
 
    # The second way to get the content of tag attribute
    img = soup.img
    print(img.get('src'))
 
    # Find all finds all img tags and returns a list
    print(soup.find_all('img'))
 
    # Combine the above two steps to get the src attribute content of all img Tags
    imgList = soup.find_all('img')
    for i in imgList:
        print(i.get('src'))
        print(type(i.get('src')))  # get returns a string

 

Third, combine urlib and bs4 to complete the crawler program

import urllib.request
from bs4 import BeautifulSoup
 
def getHtmlCode(url):  # This method passes in the url and returns the html source code of the url
    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36'
    }
    url1 = urllib.request.Request(url, headers=headers) # The Request function adds a header to the url to simulate browser access
    page = urllib.request.urlopen(url1).read()  # Save the source code of the url page as a string
    page = page.decode('UTF-8')  # String transcoding
    return page
 
def getImg(page,localPath):  # This method passes the source code of html, intercepts the img tag, and saves the picture to the local machine
 
    soup = BeautifulSoup(page,'html.parser') # Parsing pages in html format
    imgList = soup.find_all('img')  # Returns a list of all img Tags
    x = 0
    for imgUrl in imgList:  # List loop
        print('Downloading:%s'%imgUrl.get('src'))
        # urlretrieve(url,local) method saves the image to the local machine according to the url of the image
        urllib.request.urlretrieve(imgUrl.get('src'),localPath+'%d.jpg'%x)
        x+=1
 
 
if __name__ == '__main__':
    url = 'http://www.zhangzishi.cc/20160928gx.html'
    localPath = 'e:/pythonSpiderFile/img3/'
    page = getHtmlCode(url)
    getImg(page,localPath)

 

Posted by truck7758 on Fri, 29 Nov 2019 11:40:28 -0800