python crawls top100 data for cat-eye movies

Keywords: Python Excel Pycharm

Recently there are crawler-related requirements, so I looked for a video (link in the end) on Station B and made a small program, which generally did not modify, but in the final storage, changed from txt to excel.

  • Brief Requirements: Crawl Crawl Cat Eye Movie TOP100 List Data
  • Use language: python
  • Tools: PyCharm
  • Involving libraries: requests, re, openpyxl (higher version excel operation library)

Implementation Code

Cat Eye Movie Robots

 1 # -*- coding: utf-8 -*-
 2 # @Author  : yocichen
 3 # @Email   : yocichen@126.com
 4 # @File    : MaoyanTop100.py
 5 # @Software: PyCharm
 6 # @Time    : 2019/11/6 9:52
 7 
 8 import requests
 9 from requests import RequestException
10 import re
11 import openpyxl
12 
13 # Get page's html by requests module
14 def get_one_page(url):
15     try:
16         headers = {
17             'user-agent':'Mozilla/5.0'
18         }
19         # use headers to avoid 403 Forbidden Error(reject spider)
20         response = requests.get(url, headers=headers)
21         if response.status_code == 200 :
22             return response.text
23         return None
24     except RequestException:
25         return None
26 
27 # Get useful info from html of a page by re module
28 def parse_one_page(html):
29     pattern = re.compile('<dd>.*?board-index.*?>(\d+)<.*?<a.*?title="(.*?)"'
30                          +'.*?data-src="(.*?)".*?</a>.*?star">[\\s]*(.*?)[\\n][\\s]*</p>.*?'
31                          +'releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?'
32                          +'fraction">(.*?)</i>.*?</dd>', re.S)
33     items = re.findall(pattern, html)
34     return items
35 
36 # Main call function
37 def main(url):
38     page_html = get_one_page(url)
39     parse_res = parse_one_page(page_html)
40     return parse_res
41 
42 # Write the useful info in excel(*.xlsx file)
43 def write_excel_xlsx(items):
44     wb = openpyxl.Workbook()
45     ws = wb.active
46     rows = len(items)
47     cols = len(items[0])
48     # First, write col's title.
49     ws.cell(1, 1).value = 'number'
50     ws.cell(1, 2).value = 'title'
51     ws.cell(1, 3).value = 'Publicity Picture'
52     ws.cell(1, 4).value = 'To star'
53     ws.cell(1, 5).value = 'Show time'
54     ws.cell(1, 6).value = 'score'
55     # Write film's info
56     for i in range(0, rows):
57         for j in range(0, cols):
58             # print(items[i-1][j-1])
59             if j != 5:
60                 ws.cell(i+2, j+1).value = items[i][j]
61             else:
62                 ws.cell(i+2, j+1).value = items[i][j]+items[i][j+1]
63                 break
64     # Save the work book as *.xlsx
65     wb.save('maoyan_top100.xlsx')
66 
67 if __name__ == '__main__':
68     res = []
69     url = 'https://maoyan.com/board/4?'
70     for i in range(0, 10):
71         if i == 0:
72             res = main(url)
73         else:
74             newUrl = url+'offset='+str(i*10)
75             res.extend(main(newUrl))
76     # test spider
77     # for item in res:
78     #     print(item)
79     # test wirte to excel
80     # res = [
81     #     [1, 2, 3, 4, 9],
82     #     [2, 3, 4, 5, 9],
83     #     [4, 5, 6, 7, 9]
84     # ]
85 
86     write_excel_xlsx(res)

Current results

Postnote

After getting started, you find that if you use regular expressions and requests libraries for data crawling, it is critical to analyze the structure of HTML pages and regular expressions, and all you have to do is replace the url.

Supplement an example of parsing HTML construction rules

Cat's Eye Classic Sci-fi by Rating

Review the elements and we will see that each item is in <dd>**</dd>format

I want to get movie titles and ratings, first take out the HTML code and have a look

Try to construct a regular

'. *?<dd>. *? Movie-item-title. *? Title='(. *?)'>. *? Integer'>(. *?)<. *? Fraction'> (. *?)<. *?</dd>'(handwritten, unauthenticated)

 

Reference material

[Station B video 2018 latest Python 3.6 web crawler battle] https://www.bilibili.com/video/av19057145/?p=14

[Cat Eye Movie robots] https://maoyan.com/robots.txt (Better check it out before you climb, those that can't)

Posted by PhantomCube on Thu, 07 Nov 2019 06:53:49 -0800