Abstract
This is the second article of the crawler project. It mainly introduces the process of analyzing job recruitment information by using selenium crawl hook.
Selenium
Selenium is a python operating browser library, mostly used for automated testing. Its principle is to automatically control the browser through programming, without detailed information such as browser header, and without looking up and analyzing JS rendering, that is, the content in Chrome Elements can be seen and crawled, and tools such as Chromedriver need to be downloaded for useDrive the browser, this article uses Chrome, but you can also use other browsers.
HOW
In this paper, python's selenium, BeautifulSoup, pandas and other packages are used to crawl the job information obtained by the hook search for "data analysis" jobs and store it in csv.Specific steps:
- response get.
Search for "data analysis" on the hook home page, locate nationwide, get url s, build bronser objects, and open URLs using bronser
from selenium import webdriver from bs4 import BeautifulSoup import time import pandas as pd bronser = webdriver.Chrome() bronser.get('https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?px=default&city=%E5%85%A8%E5%9B%BD#order')
- url parsing.
The first level is a loop for page flipping.
For each page, first use BeautifulSoup to parse the html, use the soup.find() method to find the job list, and use bronser to find the next page button and build the object.
Then the job information of each post in the job list is extracted and saved in the job_info list, and then each job information list is saved in the job list.After crawling each page of data, determine if the next button can be executed, and if so, click on the next button using button.click(), without preventing validation, and continue running after 10 seconds
job = [] while True: soup = BeautifulSoup(bronser.page_source) list_of_position = soup.find('div',class_='s_position_list').find('ul').find_all('li') next_button = bronser.find_element_by_class_name('pager_next ') for i in list_of_position: company = i.get('data-company') companyid = i.get('data-companyid') hrid=i.get('data-hrid') positionid=i.get('data-positionid') positionname=i.get('data-positionname') salary=i.get('data-salary') tpladword=i.get('data-tpladword') location = i.find('em').string hr_position = i.find('input',class_='hr_position').get('value') position_tag = i.find('div',class_='li_b_l').text.split('\n')[-2] experience = position_tag.split('/')[0] education = position_tag.split('/')[1] company_tag = i.find('div',class_='industry').text.strip().split('/') industry = company_tag[0] financing = company_tag[1] company_scale = company_tag[2] position_describe = i.find('div',class_='list_item_bot').find('span').text company_describe = i.find('div',class_='list_item_bot').find('div',class_="li_b_r").text job_info = [positionid,positionname,company,companyid,hrid,hr_position,salary,tpladword, location,experience,education,industry,financing,company_scale,position_describe,company_describe] job.append(job_info) if 'pager_next_disabled' in next_button.get_attribute('class'): break next_button.click() time.sleep(10)
- data storage
Unlike in the past, when parsing a Web page, you save the data directly into a list instead of writing to a file, resulting in a two-dimensional list of all the data, then use pandas to convert to DataFrame DF and use df.rename() to rename the column name, then use df.to_csv() to export to a CSV file.
df = pd.DataFrame(job) columns = ['positionid','positionname','company','companyid','hrid','hr_position','salary','tpladword','location','experience','education','industry','financing','company_scale','position_describe','company_describe'] df.rename(columns=dict(enumerate(columns)),inplace=True) df.to_csv('Recruitment with hooks.csv') bronser.close()
Result
Summary
This crawl process is relatively smooth, but at the beginning it caused the login box to pop up because time.sleep() was not set, and then it tried to pop up the login box on different pages for 1s, 3s, 5s, and finally used 10s to get the data.