Preface:
In the previous article, we used the urlib and re libraries of python to complete the introduction case of crawler
However, because regular expressions are difficult to master, we use a third-party library: beautiful soup to intercept the content of the web page
text
One, download and install beautiful soup
If python3.x has pip3 installed, you can use the pip3 command line to install beautifulsop
pip3 install beautifulsoup4
If pip3 is not installed, it can also be installed through the source code
Download source package( Click to download ), enter the package directory and install from the command line
python setup.py install
After installation, enter python to test whether the installation is successful
from bs4 import BeautifulSoup
If there is no error, the installation is successful.
For the Chinese documents of bs4, please refer to the page: http://beautifulsoup.readthedocs.io/zh_CN/latest/
Two, use
Let's first look at the basic usage of bs
import urllib.request from bs4 import BeautifulSoup def getHtmlCode(url): # This method passes in the url and returns the source code of the html corresponding to the url headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36' } url1 = urllib.request.Request(url, headers=headers) # The Request function adds a header to the url to simulate browser access page = urllib.request.urlopen(url1).read() # Save the source code of the url page as a string page = page.decode('UTF-8') # String transcoding return page if __name__ == '__main__': html = getHtmlCode('http://www.zhangzishi.cc/20160413hx.html') soup = BeautifulSoup(html,"html.parser") # The beautifulsop class parses the url source code and returns an object print(type(soup)) print(soup.prettify()) # Output according to the structure of standard indentation format print(soup.p) # The first p tag in the output source code print(soup.p.name) # Output label name p print(soup.title.string) # Output label content # The first way to get the content of tag attribute print(soup.a['href']) # The second way to get the content of tag attribute img = soup.img print(img.get('src')) # Find all finds all img tags and returns a list print(soup.find_all('img')) # Combine the above two steps to get the src attribute content of all img Tags imgList = soup.find_all('img') for i in imgList: print(i.get('src')) print(type(i.get('src'))) # get returns a string
Third, combine urlib and bs4 to complete the crawler program
import urllib.request from bs4 import BeautifulSoup def getHtmlCode(url): # This method passes in the url and returns the html source code of the url headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36' } url1 = urllib.request.Request(url, headers=headers) # The Request function adds a header to the url to simulate browser access page = urllib.request.urlopen(url1).read() # Save the source code of the url page as a string page = page.decode('UTF-8') # String transcoding return page def getImg(page,localPath): # This method passes the source code of html, intercepts the img tag, and saves the picture to the local machine soup = BeautifulSoup(page,'html.parser') # Parsing pages in html format imgList = soup.find_all('img') # Returns a list of all img Tags x = 0 for imgUrl in imgList: # List loop print('Downloading:%s'%imgUrl.get('src')) # urlretrieve(url,local) method saves the image to the local machine according to the url of the image urllib.request.urlretrieve(imgUrl.get('src'),localPath+'%d.jpg'%x) x+=1 if __name__ == '__main__': url = 'http://www.zhangzishi.cc/20160928gx.html' localPath = 'e:/pythonSpiderFile/img3/' page = getHtmlCode(url) getImg(page,localPath)