The crawler uses selenium to crawl the title, host id, live content type and heat information of each room where fighting fish live is broadcast

Keywords: Python Selenium crawler

Daily Share:

Never start self-denial because of the denial of others. Life is a subjective process. Whether people like you or not is actually a matter of the other side's world. So when people don't like you, don't feel inferior and don't mean to be nice. You should focus on being yourself.

Idea analysis:

  1. URL (url of web page)
  2. Create driver Object
  3. Send get request
  4. parse data
  5. save data
  6. Page Flip

Repeatedly perform 4, 5, 6 operations in a loop and jump out of the loop if you reach the last page.

And some of the other issues I encounter when writing code:

  1. Use xpath to find the element you are looking for, but if you want to click on it, you need to slide down the page to have it on the page before you can click on it
  2. With regard to the crawling of cover pictures of each room in the fighting fish, I spent a lot of time trying. I felt I should set up a reverse crawl, crawl directly, crawl to three pictures, then set the hibernation time, crawl to several pictures, and crawl to less than ten pictures in 20s. (Because he is also a beginner, and has not yet learned anti-crawling, anti-crawling, so he gave up)
  3. time.sleep(1) must be written before the sliding operation (sleep time set by myself), and I was here for a long time because I found that the web page was not sliding, so it also led to "Next Page" "This element failed to click. If you do not set a second of hibernation, on closer inspection, you will find that it is not that the page does not slip, but that the page slides down and then goes back to the top, which may be a kind of anti-crawling.
  4. It is recommended that the xpath on the next page be written in a reliable way, such as content lookup; the xpath of the label that was copied directly at that time would cause an error when turning over to the fourth page, possibly because the xpath could not be found
  5. Source comments are more detailed
  6. If I crawled all (I was over 200 pages at that time), it would take a long time to crawl

The source code is as follows:

from selenium import webdriver
import time


class Dou_yu(object):

    def __init__(self):
        self.url = 'https://www.douyu.com/directory/all'
        self.driver = webdriver.Chrome()

    def parse_data(self):
        # It is recommended to add this hibernation, otherwise the page may not load completely and crawl fails due to network speed problems
        time.sleep(3)
        # Store all rooms on each page in a list
        room_list = self.driver.find_elements('xpath', '//*[@id="listAll"]/section[2]/div[2]/ul/li/div')
        # 120, proof no error
        # print(len(room_list))
        # Create lists to temporarily store crawled information
        data_list = []
        for room in room_list:
            # Put the crawled information into the dictionary
            tmp = {}
            tmp['title'] = room.find_element('xpath', './a/div[2]/div[1]/h3').get_attribute('textContent')
            tmp['type'] = room.find_element('xpath', './a/div[2]/div[1]/span').get_attribute('textContent')
            tmp['owner'] = room.find_element('xpath', './a/div[2]/div[2]/h2/div').get_attribute('textContent')
            tmp['popularity'] = room.find_element('xpath', './a/div[2]/div[2]/span').get_attribute('textContent')
            data_list.append(tmp)
            # tmp['picture'] = room.find_element('xpath', './a/div[1]/div[1]/picture/img').get_attribute('src')
            # print(tmp)
        return data_list

    def save_data(self, data_list):
        # Convert to str first for easy file writing
        data_list = str(data_list)
        with open('Dogfish.txt', 'a', encoding='utf-8')as f:
            f.write(data_list+'\n')

    def run(self):
        self.driver.get(self.url)
        # url
        # Create driver
        # Send get
        # parse-data
        while True:
            data_list = self.parse_data()
            # save-data
            self.save_data(data_list)
            # Page Flip
            # This time.sleep(1) has to be written (sleep time set by myself), and I was here for a long time because I found that the web page was not sliding, so it also led to "Next Page" "This element failed to click. If you do not set a second of hibernation, on closer inspection, you will find that it is not that the page does not slip, but that the page slides down and then goes back to the top, which may be a kind of anti-crawling.
            time.sleep(1)
            # Set a larger value and slide to the bottom
            js = 'scrollTo(0,10000)'
            # Execute js code
            self.driver.execute_script(js)
            # Looking closely at the first and last pages of the live game, the aria-disabled values in the li tag they belong to are different, not faule at the end, true at the end, and jump out of the loop after the end
            if self.driver.find_element('xpath', '//li[@title="next page"]'.get_attribute ('aria-disabled') =='true':
                break
            # Find the next page button and click
            self.driver.find_element('xpath', '//span[contains(text(), "next page"]'). click()


if __name__ == '__main__':
    dou_yu = Dou_yu()
    dou_yu.run()

Some results are as follows:

 

Posted by RedRasper on Thu, 25 Nov 2021 10:05:49 -0800