selenium crawl hook data analysis job recruitment content

Keywords: Python Selenium Programming

Abstract

This is the second article of the crawler project. It mainly introduces the process of analyzing job recruitment information by using selenium crawl hook.

Selenium

Selenium is a python operating browser library, mostly used for automated testing. Its principle is to automatically control the browser through programming, without detailed information such as browser header, and without looking up and analyzing JS rendering, that is, the content in Chrome Elements can be seen and crawled, and tools such as Chromedriver need to be downloaded for useDrive the browser, this article uses Chrome, but you can also use other browsers.

HOW

In this paper, python's selenium, BeautifulSoup, pandas and other packages are used to crawl the job information obtained by the hook search for "data analysis" jobs and store it in csv.Specific steps:

  • response get.

Search for "data analysis" on the hook home page, locate nationwide, get url s, build bronser objects, and open URLs using bronser

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd

bronser = webdriver.Chrome()
bronser.get('https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?px=default&city=%E5%85%A8%E5%9B%BD#order')
  • url parsing.

The first level is a loop for page flipping.

For each page, first use BeautifulSoup to parse the html, use the soup.find() method to find the job list, and use bronser to find the next page button and build the object.

Then the job information of each post in the job list is extracted and saved in the job_info list, and then each job information list is saved in the job list.After crawling each page of data, determine if the next button can be executed, and if so, click on the next button using button.click(), without preventing validation, and continue running after 10 seconds

job = []
while True:
    soup = BeautifulSoup(bronser.page_source)
    list_of_position = soup.find('div',class_='s_position_list').find('ul').find_all('li')
    next_button = bronser.find_element_by_class_name('pager_next ')
    
    for i in list_of_position:
        company = i.get('data-company')
        companyid = i.get('data-companyid')
        hrid=i.get('data-hrid')
        positionid=i.get('data-positionid')
        positionname=i.get('data-positionname')
        salary=i.get('data-salary')
        tpladword=i.get('data-tpladword')
        location = i.find('em').string
        hr_position = i.find('input',class_='hr_position').get('value')
        position_tag = i.find('div',class_='li_b_l').text.split('\n')[-2]
        experience = position_tag.split('/')[0]
        education = position_tag.split('/')[1]
        company_tag = i.find('div',class_='industry').text.strip().split('/')
        industry = company_tag[0]
        financing = company_tag[1]
        company_scale = company_tag[2]
        position_describe = i.find('div',class_='list_item_bot').find('span').text
        company_describe = i.find('div',class_='list_item_bot').find('div',class_="li_b_r").text
        job_info = [positionid,positionname,company,companyid,hrid,hr_position,salary,tpladword,
                    location,experience,education,industry,financing,company_scale,position_describe,company_describe]
        job.append(job_info)
        
    if 'pager_next_disabled' in next_button.get_attribute('class'):
        break
    next_button.click()
    time.sleep(10)
  • data storage

Unlike in the past, when parsing a Web page, you save the data directly into a list instead of writing to a file, resulting in a two-dimensional list of all the data, then use pandas to convert to DataFrame DF and use df.rename() to rename the column name, then use df.to_csv() to export to a CSV file.

df = pd.DataFrame(job)
columns = ['positionid','positionname','company','companyid','hrid','hr_position','salary','tpladword','location','experience','education','industry','financing','company_scale','position_describe','company_describe']
df.rename(columns=dict(enumerate(columns)),inplace=True)
df.to_csv('Recruitment with hooks.csv')
bronser.close()

Result

Summary

This crawl process is relatively smooth, but at the beginning it caused the login box to pop up because time.sleep() was not set, and then it tried to pop up the login box on different pages for 1s, 3s, 5s, and finally used 10s to get the data.

Posted by Homer30 on Fri, 20 Sep 2019 20:07:09 -0700